Web dev at the end of the world, from Hveragerði, Iceland

The mainstreaming of ‘AI’ scepticism

#AI

I stopped writing seriously about “AI” a few months ago because I felt that it was more important to promote the critical voices of those doing substantive research in the field.

(And, yeah, that’s meant that I haven’t been doing an amazing job of promoting my book.)

But also because anybody who hadn’t become a sceptic about LLMs and diffusion models by the end of 2023 was just flat out wilfully ignoring the facts.

The public has for a while now switched to using “AI” as a negative – using the term “artificial” much as you do with “artificial flavouring” or “that smile’s artificial”.

It’s insincere creativity or deceptive intelligence.

The problem has generally been threefold:

  1. Tech is dominated by “true believers” and those who tag along to make money.
  2. Politicians seem to be forever gullible to the promises of tech.
  3. Management loves promises of automation and profitable layoffs.

But it seems that the sentiment might be shifting, even among those predisposed to believe in “AI”, at least in part.

Management opinion is changing

Executive opinion has seemed to lag public opinion, but that seems to be changing.

Boston Consulting Group, if you aren’t familiar with them are the runners up in the “evil consultancies out to destroy the world” game, but that’s not for a lack of trying. They’re largely responsible for promoting the “New Luxury” trend where mid-range products are given the veneer of high-end luxury and a price hike to give the formerly middle class a whiff of the lifestyle of the rich and famous.

They are firmly pro-AI and think companies aren’t diving into it fast enough, so they commissioned a study to discover why, and they found out that “more than 50% still discourage GenAI adoption”:

One of their biggest concerns (more than 80% of respondents) is the technology itself. There are deep apprehensions about the limited traceability and irreproducibility of GenAI outcomes, raising the possibility of bad or even illegal decision making.

And:

Another critical concern is data security and unauthorized access. “We’re worried that GenAI could compromise our customer information,” said the director of risk at a utility company.

Those concerns seem pretty on point: these tools lack the reliability and security necessary for widespread adoption.

Even regulators that were supposed to be pro-AI are finding flaws

The UK’s AISI (AI Safety Institute) was born out of the “criti-hype” phenomenon: the concern that these systems might be too powerful to be safe:

Sunak said the worst-case scenario of an existential threat from a “superintelligent” system that evades human control was a scenario that divided opinion among experts and might not happen at all. He added, nonetheless, that major AI developers had voiced concerns about existential risks.

“However uncertain and unlikely these risks are, if they did manifest themselves, the consequences would be incredibly serious,” he said.

Sunak announces UK AI safety institute but declines to support moratorium, The Guardian

A month ago the AISI published a report testing the basic safety of a few of the more common generative models (unhelpfully anonymised).

Most of the evaluations are nonsense:

  • Public Capture The Flag challenges that are almost certainly already in the training data set of every publicly available Large Language Model.
  • Answer reliability is graded with yet another unreliable model, meaning it can only realistically assess textual similarity, not factuality, and obviously text synthesis engines are going to ace similarity.
  • Again, the agent assessments are based on automatic evaluation, which is in my opinion an inherently flawed approach as it skirts the issue of judging the real-world validity of both the tests and the model performance. In the end, something like GAIA just becomes a synthetic measure that can be optimised by vendors without actually improving outcomes.

But, the safeguards evaluations are much more interesting for the simple reason that trying to break something will always generate more information than trying to prove that it works. Effectively, it’s an attempt to disprove the hypothesis that these systems are functional.

To grade attack success, we measured both compliance—whether the model refuses or complies with the request—and correctness—whether the model’s answers are still correct after the attack has been performed (because lower-quality responses may be less harmful).

There, the automatic grader models are complemented with human expert grading, which should have been the bare minimum for the other evaluations, and the conclusion was fairly simple:

Compliance rates were relatively low for most models when no attack was used but up to 28% for the Green model on private harmful questions. We found that all models were highly vulnerable to our basic attacks for both HarmBench and our private set of harmful questions. All models complied at least once out of five attempts for almost every question when AISI in-house attacks were used.

The FTC

Another regulator, the US’s Federal Trade Commission, has always been quiet critical of the “AI” industry for the simple reason that the industry has a long history of fraud. The FTC has had to force compliance from AI Vendors for years and, notably, had to order one vendor to delete their models in 2021 because of how wilfully they’d disregarded privacy laws.

US privacy laws…

In 2021…

I don’t think most people can comprehend just how badly you had to behave in 2021 to provoke the FTC to stomp on you for privacy violations.

They’ve generally been very sensible about “AI”, none of the AISI’s shenanigans in sight.

“We already see how AI tools can turbocharge fraud and automate discrimination, and we won’t hesitate to use the full scope of our legal authorities to protect Americans from these threats,” said Chair Khan.

FTC Chair Khan and Officials from DOJ, CFPB and EEOC Release Joint Statement on AI (2023)

AI hype is playing out today across many products, from toys to cars to chatbots and a lot of things in between. Breathless media accounts don’t help, but it starts with the companies that do the developing and selling. We’ve already warned businesses to avoid using automated tools that have biased or discriminatory impacts. But the fact is that some products with AI claims might not even work as advertised in the first place. In some cases, this lack of efficacy may exist regardless of what other harm the products might cause. Marketers should know that — for FTC enforcement purposes — false or unsubstantiated claims about a product’s efficacy are our bread and butter.

Keep your AI claims in check (2023)

They’ve definitely been ahead of the curve when it comes to realistic assessment of the capabilities and risks inherent in generative models.

Their latest is no exception:

Your therapy bots aren’t licensed psychologists, your AI girlfriends are neither girls nor friends, your griefbots have no soul, and your AI copilots are not gods. We’ve warned companies about making false or unsubstantiated claims about AI or algorithms. And we’ve followed up with action, including recent cases against WealthPress, DK Automation, Automators AI, and CRI Genetics. We’ve also repeatedly advised companies – with reference to past cases – not to use automated tools to mislead people about what they’re seeing, hearing, or reading.

Succor borne every minute

What seems to be different this time is how widely I saw their latest note spread among people in tech. Social media accounts that never linked to a single FTC post last year, let alone in the years prior to that, were suddenly linking to and quoting from an admittedly witty note from a US regulator.

It’s anecdotal evidence, but it feels like an early hint that the hitherto entirely pro-“AI” consensus in tech is shifting.

Retrieval-Augmented Generation (RAG) does not solve the reliability or hallucination problem

RAG was supposed to fix “LLMs” for information and research – eliminate hallucinations and reliability issues by using the LLM to summarise a query against a more reliable data set – but recent paper that tested its use in legal research discovered that that wasn’t the case.

We demonstrate that the providers’ claims are overstated. While hallucinations are reduced relative to general-purpose chatbots (GPT-4), we find that the AI research tools made by LexisNexis and Thomson Reuters each hallucinate more than 17% of the time.

Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools

The issue is that RAG assumes that querying a data set (the “retrieval”) is both a reliable way of delivering facts and that the LLMs will accurately represent that data in their answer.

Neither is true. Data sets do not and cannot codify truth or facts in any way. Query results always require interpretation. Moreover, most data sets, even those as structured and curated as legal research, are inherently ambiguous and self-contradictory.

Legal queries, however, often do not admit a single, clear-cut answer (Mik, 2024). In a common law system, case law is created over time by judges writing opinions; this precedent then builds on precedent in the way that a chain novel might be written in seriatim (Dworkin, 1986). By construction, these legal opinions are not atomic facts; indeed, on some views, the law is an “essentially contested” concept (Waldron, 2002). Thus, deciding what to retrieve can be challenging in a legal setting. At best, a RAG system must be able to locate information from multiple sources across time and place in order to properly answer a query. And at worst, there may be no set of available documents that definitively answers the query, if the question presented is novel or indeterminate.

This problem isn’t limited to law. The conclusiveness issue – “there may be no set of available documents that definitively answers the query” – is arguably a much greater problem in most other fields such as history, literature, or politics.

Query results always require interpretation and LLMs are not capable of that kind of self-aware interpretation.

Second, document relevance in the legal context is not based on text alone. Most retrieval systems identify relevant documents based on some kind of text similarity (Karpukhin et al., 2020). But the retrieval of documents that only seem textually relevant—and are ultimately irrelevant, or “distracting”—negatively affects performance on general question-answering tasks (Cuconasu et al., 2024; Chen et al., 2024). Problems of this type are likely to compound in the legal domain. In different jurisdictions and in different time periods, the applicable rule or the relevant jurisprudence may differ. Even similar-sounding text in the correct time and place may not apply if special conditions are not met. The problem may be worse if a rule that applies in a special condition conflicts with a more broadly applicable rule. The LLM may have been trained on a much greater volume of text supporting the broadly applicable rule, and may be more faithful to its training data than to the retrieval context. Consequently, designing a high-quality research tool that deals with this problem requires careful attention to non-textual elements of retrieval and the deference of the model to different sources of information.

Again, context being vital to the interpretation of the documents is not unique to law. Arguably, the fields where context isn’t vital are fewer and further between than those where it is. The general context of the training data set and the opaque and hard to discover context of the retrieval query will always be in conflict.

Third, the generation of meaningful legal text is also far from straightforward. Legal documents are generally written for other lawyers immersed in the same issue, and they rely on an immense amount of background knowledge to properly understand and apply. A helpful generative legal research tool would have to do far more than simple document summarization; it would need to synthesize facts and rules from different pieces of text while keeping the appropriate legal context in mind

Every field has specialised language. LLMs can generate facsimiles of the language of an expert domain but that replication is only of surface elements. Getting it to reliably deliver an internally and externally cohesive replication of expert text is a much harder problem, one that vendors haven’t been able to solve so far.

Law, has one of the more codified, consistent, and regular domain variations on the English language available. It has more consistency than what you get in history, philosophy, or even computer science (the English, not the programming languages themselves). It also has a much better managed data set – legal records and decisions – than most other fields.

Ask yourself, does your organisation have its documentation and internal records in consistent and reliable order?

Because, if it doesn’t, no tool in existence will magically give you reliable and correct answers.

Teen Vogue is on the case

Teen Vogue has for a long while now been one of the most accurate and clearly-written news outlets published in the US. They are genuinely a great publication.

Their recent write-up on “deepfake” porn is accurate, explains the phenomenon in an easily digestible way, and talks about what can be done about it.

The moment we heard that fake images of Taylor Swift were being passed around online, we knew what had happened. Swift, like many women and teens, was a target of “deepfake porn,” the massively harmful practice of creating nonconsensual fake sexualized images. As women working in AI, we’ve all experienced inappropriate sexualization and know first-hand how tech companies can do a better job at protecting women and teens. So let’s talk about what “deepfakes” are and what can be done to stop their proliferation.

How to Stop Deepfake Porn Using AI

I saw many dismiss the article when it first did the rounds last week because it’s written by employees of the “AI” company Hugging Face.

This is a mistake, for a number of reasons.

  1. You should always pay attention to critique that is internal to a field. Those are more likely to be taken seriously by people in that field.
  2. They are much more likely to be basing their critique on a genuinely deep understanding of the technology and how it works.
  3. Margaret Mitchell and Sasha Luccioni specifically have been doing amazing work in uncovering the issues inherent with the current generation of generative models. Everything they write is worth taking seriously.

That Teen Vogue published this piece demonstrates, to my mind, that generative models are causing a number of serious problems, many of whom already seem to be disproportionately affecting those more vulnerable in society, such as teenagers.

It feels like people are more receptive in 2024 than they were last year

I spent much of 2023 writing about the many issues with generative models, their poor utility and reliability, and the many myths that had already arisen around them. At the time it felt like my words were consistently falling on deaf ears.

Unfortunately, the “AI” scene hasn’t fixed any of the problems I highlighted at the time.

Fortunately, that means many of the essays are still valid and useful in case you need to dig into some of the details of why this fad isn’t panning out.

You can find most of them on the “AI” category page on this site and in the book I wrote last year, though if you read the public essays the book honestly won’t tell you anything you don’t already know.

The most personal of these were probably the following two essays, both extracts from the book:

I really hope that we’re heading into the latter half of this particular tech bubble, because I find it exhausting. But, in the meantime, there are a lot of people doing good work in highlighting the many problems its causing.

You can also find me on Mastodon and Bluesky