Poisoning for propaganda: rising authoritarianism makes LLMs more dangerous

3 February 2025 – Baldur Bjarnason

I’m still working on the newsletter reboot but felt the following was important enough to note. Think of it as a Public Service Announcement of sorts.

I’d like to reiterate what I said a while back: integrating LLM-based tools – chatbots, copilots, agents, etc. – into all corporate and personal workflows is outright dangerous. Even when run locally, most LLMs in use are trained and tuned by corporations that are now deeply in bed with a lawless authoritarian takeover of the US.

People who are removing all references to minorities, women, and equality from your public spheres will not hesitate to ask corporations to tune centrally-controlled LLMs to censor the same from your work.

“But I’d notice if the LLM started censoring my work!”

Really? Did you notice this?

As some people already mentioned here or here, Copilot purposely stops working on code that contains hardcoded banned words from GitHub, such as gender or sex.

Copilot stops working on gender related subjects #72603

The point of cognitive automation is NOT to enhance thinking. The point of it is to avoid thinking in the first place. That’s the job it does. You won’t notice when the censorship kicks in.

“This isn’t going to happen. These are profit-motivated companies who are trying to sell LLMs as a productivity miracle. They won’t compromise the LLMs because that would make them less productive.”

Really?

From the same discussion I linked to above.

I have trans_time all over my code and CoPilot refuses to talk about it.

Open weight models do not solve this problem either as regular users and businesses are not going to be hand-tuning and running their own models. They are going to access them through services and software products and those will largely be controlled by organisations that have a similarly cosy relationship with the US administration as the ones selling closed-weight models. We also do not have any assurances about the actual safety or security of the open-weight models themselves. There’s no reason to believe that Meta’s open-weight models aren’t going to be following the US administration’s policies nor is it plausible that DeepSeek is somehow not going to follow the precepts of the Chinese Communist Party.

The truth about modern AI is that there is that every major “AI” company today is in bed with an authoritarian government. The ones that aren’t – such as the European ones – are distant runners up to the US or Chinese models.

They are all open to direct – keyword-based – censorship.

But the actual impact is likely to be more subtle and insidious than flat out censorship.

The censorship approach, such as that applied by GitHub above on gender topics and by DeepSeek on topics that the Chinese government disapproves of, is the simplest form of censorship you can apply to an LLM-based system. It doesn’t require any alteration of any part of the actual model and is instead applied by filtering the prompt (or possibly at the tokenisation stage).

This is an effective approach when your concern is legal liability. It lets you shut the model down if it ventures into a topic that could lead your companies to suffer reprimands or fines by the state. It’s an approach that makes the most sense if you are agnostic about the censorship itself.

But if you are a willing participant in authoritarianism – which seems to be the case for Google, Apple, Meta, OpenAI, and Microsoft – there are subtler and more effective methods for altering a model’s output to suit your ideology.

I’ve described this before as giving “a handful of CEOs a racism and bigotry dial for the world’s English-language corporate writing.”

The alternative approach to censorship, fine-tuning the model to return a specific response, is more costly than keyword blocking and more error-prone. And resorting to prompt manipulation or preambles is somewhat easily bypassed but, crucially, you need to know that there is something to bypass (or “jailbreak”) in the first place.

A more concerning approach, in my view, is poisoning.

At both the training and fine-tuning stages of a language model, you only need a small number of purpose-chosen token streams to “poison” the model for a given keyword. You can design this poisoning to shift the sentiment of the model’s response whenever that keyword appears in a prompt without resorting to heavy-handed tactics such as blocking the reply entirely.

That is, instead of not responding when the word “trans” appears in the prompt, it can be designed to always respond in a way that casts the word in a bad light.

In effect: propaganda.

I’ve written about model poisoning before:

Since those two essays were published all major AI vendors seem to have given up on preventing the attack and are instead throwing ever increasing numbers of poorly vetted documents into their training data.

There doesn’t seem to be a meaningful limit to how many keywords could be manipulated this way. Certainly most of the current US administration’s bugbears could be covered without making the models any more useless than they already are.

The reason why I think that poisoning will become the ideological propaganda tool of choice in the long term is that, unlike prompt preambles or keyword banning, you can’t easily test for sentiment manipulation. That a model might return a negative-sounding response to every query featuring “feminism” or “gay” is not a smoking gun as without access to the training data set itself, it’s impossible to be sure that it isn’t just a bias inherent in the data set.

Poisoning for propaganda has built-in plausible deniability and, as you can see from how the media is covering current events in the US, that’s all they need to carry on.

Open weight models, especially, seem likely targets as they have every economic incentive to cut corners. Even the training data sets seem to be insecure as a common tactic for circumventing privacy regulations is to not actually store the training data itself, only URLs, which means that the documents are fetched again every time it’s set up. Hashes or checksums are infeasible to ensure the consistency of dynamic web pages, which means that it’s trivial for bad actors (especially state actors) to take over chunks of the training data set and use it to manipulate the resulting model.

It’s entirely likely that many open weight models have been compromised without anybody involved in the project realising it.

And when the vendor who is doing the training is untrustworthy in the first place, such as Meta or DeepSeek, it’s only sensible to assume that the model has been compromised until proven otherwise.

Integrating LLMs – whether it’s a chatbot, copywriting or proofreading tool, or copilot – into your work or business processes is effectively giving a biased organisation an “ideology” dial for your writing and messaging. They might not have touched the dial up until now (though, I have my doubts, see what I wrote above about compromised models) but you have no assurances that they won’t grab the dial and tune your writing, your marketing, your emails into outright propaganda.

Even local models provided by your OS aren’t safe because they can and will be changed in an OS update.

Previous entry

First outlines of a plan: thinking strategically about a modern tech and media business

24 January 2025
Next entry

AI and Esoteric Fascism

27 February 2025

Poisoning for propaganda: rising authoritarianism makes LLMs more dangerous

Join the Newsletter