GDPR and American AIs
Italian regulators turn their attention towards OpenAI and ChatGPT #
If you’ve been following AI discussions in social media, you might have heard that the Italian privacy regulator banned ChatGPT in Italy. This has led to jokes such as it being retaliation for ChatGPT recommending you snap spaghetti in two to make boiling it easier or that you put pineapple on your pizza.
It’s also led to the usual accusations of regulators being against progress and hating cool things. (This is the sort of attitude that’s the reason why the US still hasn’t banned asbestos.)
Instead of just reading yet another pundit spout an opinion based on somebody’s response to what somebody else thinks the ban might actually be about, I decided to go and read the complaint itself. (Scroll for English).
It’s relatively straightforward, if you have some familiarity with the GDPR.
(So, not straightforward at all, honestly.)
The GDPR is the EU’s data privacy regulation. It harmonises data privacy laws across the EU, and it has pretty stiff fines for violations. From the Wikipedia page: “€20 million or up to 4% of the annual worldwide turnover of the preceding financial year in case of an enterprise, whichever is greater.” (Emphasis mine.)
One way GDPR limits the abuses of data collection is data can only be collected with consent and used for a specific legitimate purpose. This is to prevent companies from just hoovering up data and then just it as a general-purpose building block for products, analytics, and surveillance. If you collect data, it must be for a specific purpose.
This post has more details on the implications this has on language models. It’s from a couple of months ago so any competent tech co should have known this was coming. The primary complaint is that OpenAI is collecting personal data and using that to train the model. To be more specific, OpenAI is collecting pretty much all the internet, which inevitably is going to contain personal data, and training on that.
Since the model is a general-purpose language model, there is no way for it to enforce the purpose restriction required by the GDPR. It is specifically a general-purpose language model intended to be the foundation other tools build on.
Even if OpenAI somehow did get around the purpose-restriction, they don’t have consent from the owners of the private data.
Even if they did get that, for example by arguing that by publicly posting the data the owner has given implicit consent, OpenAI doesn’t support the right of erasure: you can’t ask it to delete all your personal data. This is essential for the implicit permission defense because you need to be able to erase data that was accidentally or maliciously made public. Machine ‘unlearning’ hasn’t caught up with regular machine learning and, as far as I can tell, nobody’s got it working properly on a system the size of GPT-3 or GPT-4. So, even without any other issues, just judging from the inclusion of crawled websites in the training data, it looks like OpenAI is indeed breaking GDPR regulations.
But, additionally, by their own admission, OpenAI was by default training on user data until a month ago and they themselves admitted that deleting that user data is impossible. Both are admissions of clear violations of the GDPR.
It certainly looks like OpenAI is pretty unambiguously in the wrong here. But, more importantly for tech’s AI aspirations, it looks like the same applies to every other foundational model out there.
If the Italian regulator is right, and it looks like they are, then this generation of large-language-models just might not be compatible with the GDPR.
Almost as important as the primary complaint are the data breaches. Not properly reporting breaches to both the regulator and affected users is a serious GDPR violation for a service of that size. Companies have been fined for that in the past. OpenAI has had a bunch of data breaches lately that it handled poorly, which the regulator cites in its notice.
The thing about having 100 million users is you get the regulation that comes with it. I have no idea how OpenAI are going to handle this, but the complaint seems valid enough.
And because, unlike the big tech cos, OpenAI doesn’t have a presence in the EU and hasn’t picked a lead supervisory data protection authority, all of the regulators have jurisdiction. “Controllers without any establishment in the EU must deal with local supervisory authorities in every Member State they are active in.”
Italy might just be the first and, unfortunately for OpenAI, every single regulatory body has the power to fine. They could be facing multiple fines from multiple countries.
People have been wondering how on earth these models were supposed to comply with the GDPR for months now.
The answer seems to be that they aren’t.
More links on the GDPR and model privacy #
- More on the GDPR reporting requirement.
- The 100 million user thing which is gonna get you some heavy-handed regulation 'cause that’s how the law is supposed to work
- More on the data privacy issues with ChatGPT which are so bad they might even get it dinged in non-GDPR jurisdictions
- One reason machine unlearning is tricky for a service of that size is that it can be an exploit vector.
The AI is an American #
Years ago, I had a go at explaining to somebody that AI colourisation inevitably erased variation and minorities out of history.
AI-generated images are that x1000. Everything becomes American.
The other AI links #
- “AI as centralizing and distancing technology”. This touches on one of my concerns with how Microsoft and Google are proposing to use AI. It’s about putting AI between people. Separate them in the name of ‘productivity’.
- “SoK: On the Impossible Security of Very Large Foundation Models”. I’ve only had a quick read of this preprint but it manages to both pull together many of the issues with large language models I’ve seen raised in other papers and give them a solid, reasoned foundation.
- “※ ChatGPT Survey: Performance on NLP datasets”. Related to what I noted recently about these large foundational models being technically flawed. Turns out ChatGPT isn’t actually that good at natural language tasks compared to simpler models.
- ‘Statement from the listed authors of Stochastic Parrots on the “AI pause” letter’.
- “AI is going to make teaching worse, but not in the way everyone thinks - Charles Kenneth Roberts”.
- “Buzzfeed Has Begun Publishing Articles Generated by A.I. — Pixel Envy”. Predictably, the articles are even worse than Buzzfeed’s usual.
- “The Impact of AI on Developer Productivity: Evidence from GitHub Copilot”. I can spot at least four serious flaws in this at a glance. But all you need to know is that most of the authors work for Microsoft or Github.
- “Manton Reece - Introducing Micro.blog podcast transcripts”. One of the good things to come out of the current AI bubble are improved automatic transcripts.
- “The problem with artificial intelligence? It’s not artificial or intelligent”. “The ultimate risk of not retiring terms such as ‘artificial intelligence’ is that they will render the creative work of intelligence invisible, while making the world more predictable and dumb.”
- “Neither artificial, nor intelligent - hidde.blog”.
- “Policy makers: Please don’t fall for the distractions of #AIhype - by Emily M. Bender - Mar, 2023 - Medium”
- “Code, not Chat, in Generative AI”. I think the seeming effectiveness of AI-assisted coding is leading many coders to assume that it’s as useful in other jobs, which I think is a big mistake. (Note the ‘seeming’ in that sentence.)
Software Development Links #
- “The web we broke. — Ethan Marcotte”
- “The Most Dangerous Codec in the World: Finding and Exploiting Vulnerabilities in H.264 Decoders”. This is pretty bad.
- “Types in JavaScript With Zod and JSDoc - Jim Nielsen’s Blog”. Definitely an approach I’d like to try out in my projects.
- “JavaScript import maps are now supported cross-browser”. Now all we need are module workers in Firefox.
- “Defaulting on Single Page Applications (SPA)—zachleat.com”
- "Ship Small, Ship Fast"