Binary humans, nuanced AI

For my whole life, the meme was that computers had to be literal, and that only humans could be subtle and nuanced. It is therefore somewhat ironic (or, to go back to the ancient Greek origin of the word "irony", Alazṓnic) that now we have an AI so good that we're becoming increasingly unsure if the words we see were written by human or machine, that the question of AI safety is one where so many people have so much less nuance than even the existing large language models.

"Is AI safe?" isn't even a well-formed question when asked in isolation. Indeed, for all X, "[is/are] ${X} safe?" are not well-formed questions: "Are cars safe?" depends on the car, the driver, the roads themselves, and the other road users. "Are planes safe?" depends on the plane, the pilot, the weather, the maintenance, if you're flying too close to a warship where either you or they have the wrong kind of IFF system, and if you get hijacked. "Are chemicals safe?" depends on the chemical (water is one atom away from being bleach!), the temperature, how it's being handled, and how much there is.

And so it is with AI: "Is AI safe?", a simple chess player on your phone, in isolation, won't end the world. Both the USSR and the USA had automated early warning radar systems that, during the cold war, reported a first strike attack in progress due to the the sun and the moon respectively, and only human intervention in each of these cases prevented these systems from ending the world.

But even if I reduce this from AI in general to just large language models like ChatGPT etc., it's still not clear. Why? Because, even after a year and a half in the public eye (and longer in the eyes of nerds like me), nobody really knows the limits of LLM capabilities.

At one level, they look safe: we can say they are just playing a game of 'what word comes next?'

But they're good enough at this game that people are often tempted to use them as a control mechanism for something else, such as robots, or just blindly use the results without bothering with human oversight even when they have liability for errors, like those two New York lawyers who got in the news (and in trouble) for letting ChatGPT do their job for them.

(As an aside: some people suggest that LLMs are harmful because they "allow students to cheat in exams", which I find as ridiculous as saying "calculators are harmful, you won't have one on you at all times when you're an adult" or "writing is harmful, you won't bother remembering things if you wrote them down" — whenever an automation is good enough to use to cheat on a test, the thing being tested has necessarily transitioned from 'something that we must teach our kids' to 'something you can learn for fun, if you like').

So, LLMs definitely have at least as much potential for danger as any other form of automation, simply because they are being used for general automation — claiming that any LLM is "risk free" is a bit like saying "we can make bug-free software"… except even worse, as the current state of the art for LLMs means that really comes with the caveat "and we manage that even though we only hire interns".

There are further forms of harm besides this. All technology, without exception, is about enabling things that were previously too hard or too expensive. There's a lot of things out there which have already been invented, but where knowledge of the invention isn't widespread — ever since Word Lens was released at the end of 2010, I've been surprising and delighting people with the fact that our phones can give us live, augmented reality, translations of whatever our phone camera can see. Even in 2024, I'm regularly meeting people who don't know this is built into Google Translate (these days, the tech is also built into the iOS photo app). But for dangerous things? The 9/11 attacks happened just as I was reaching adulthood, and my generation got to witness a bunch of really dumb attempts to cause harm. Imagine if the Shoe Bomber and his co-conspirators had been able to even just brainstorm ways their attack could go wrong — an LLM is "just" next-word-prediction, it "only" knows things it finds on the public internet, what can go wrong?

And even that is a simple case. AI can explain research papers and help write and debug code, so someone mediocre — not a total zero, but still a junior — can take existing work, and get help from an LLM to turn it into something much more impressive (after all, that's the point of automation: you don't need to be a master woodworker to use a CNC machine). How can an LLM be used this way to cause harm?

By going as close as we dared, we have still crossed a grey moral boundary, demonstrating that designing virtual potential toxic molecules is possible without much effort, time or computational resources. We can easily erase the thousands of molecules we created, but we cannot delete the knowledge of how to recreate them.

LLMs make this kind of work easier to reproduce. And still do even with the current attempts at alignment, because the thing which scared Fabio Urbina et al. was, in layman's terms, flipping a switch between good and evil (such a flip can also happen accidentally, in more than one way). Downloadable weights can be easily modified, but even models kept locked behind an API can be manipulated to some extent — the API-only access for which OpenAI is getting criticised is not a silver bullet, but it is better than nothing… but with a caveat that research into some of the many open questions on the subject of AI safety have been enabled precisely by the downloadable models which are so easy to manipulate.

But all of this is "small scale" danger and harm. Yes, I really am describing the potential for novel toxic chemicals as "small danger". Why? Because of how much worse it can get. When AI leads to some disaster, what I expect the disaster to be self-limited by the scope of whatever the AI was being asked to do. The large scale discrete incidents from failures of human cognition have been the Bhopal gas tragedy and the Chernobyl disaster. I expect more of that, not the setup for The Terminator or Universal Paperclips.

Now, you may be appalled by all these scenarios, and that's fair. By the standard of AI safety researchers, the fact that I think LLMs won't get outcomes worse than Bhopal and Chernobyl makes me an optimist; and given how easy it is to re-align an existing model — to find and flip it's "evil" switch (not that whoever is doing this would describe their own actions that way as we're all the heroes of our own stories) — then I do foresee people downloading models, making these changes, using them to control industrial hardware… and that hardware literally exploding in their own face.

The more severe position, is that we will get AI with agency, which plot to take over the world for whatever reason. This is certainly a thing to keep an eye out for (it is very easy to give an LLM agency), but right now those LLMs run into situations they can't cope with — and even if they could cope with anything, right now humanity collectively can out-think AI, both in terms of diversity of thought (if you've met one LLM you've met them all, even though they're good at playing many roles), learning from few examples (LLMs make up for needing loads of examples by reading basically everything very quickly), and gross throughput (there's a lot of humans, while computer hardware is currently a limiting factor).

All of those "defences" are temporary or partial. Evolution designed our brain to be both power-efficient and to get a long way from a few examples, and evolutionary algorithms are very easy to do in a computer. And we're building more hardware as fast as possible. And they're only partial, because there's a non-zero risk of an AI only needing to be really good at one thing to cause problems, and also a non-zero risk of us being explicitly told about a long-term danger and always picking the short-term reward (say, a re-run of the last century of CO2 emissions, which we did knowingly because we wanted what the power stations could give us more than we cared about the ice caps melting).

I think the challenges in these areas buy us 5-15 years to answer a question that has haunted our species for at least as long as we've had writing, and probably longer:

What does "good" mean?

Now, if only I could convince people it wasn't obvious when they didn't already believe that. So much online discussion is P(doom) = 0 or P(doom) = 1.

Tags: AI, Artificial intelligence, Bayesian, ethics, future of humanity, misaligned incentives, outside context problem, paperclip optimiser, rationality, reasoning, Technological Singularity

Categories: AI

© Ben Wheatley — Licence: Attribution-NonCommercial-NoDerivs 4.0 International