An underappreciated fact about large language models (LLMs) is that they generate “live” answers to prompts. You prompt them and they start talking in response and talk until they are done. The result is like asking a person a question and getting a monologue back in which they improvise their answer sentence by sentence.
This explains several ways in which large language models can be so frustrating. The model sometimes even contradicts itself within a paragraph, saying something and then immediately following the exact opposite because it’s just “reasoning out loud” and sometimes adjusting its impression on the fly. As a result, any complex logic requires a lot of hand-holding by AIs.
This story first appeared in the Future Perfect Newsletter.
Sign up here to explore the big, complex problems facing the world and the most effective ways to solve them Sent twice a week.
This is called a well-known solution Chain-of-thought promptingWhere you effectively ask the large language model to “show its work” by “‘thinking’ out loud” about the problem and giving an answer only after laying out all its reasoning step by step.
Chain-of-thought prompting treats the language model With much more intelligenceWhich is not surprising. Compare how you would respond to a question demanding an immediate answer to how you would respond if someone shoved a microphone in your face and asked you to write a draft, review it, and then press “publish.”
Power of thought, then answer
OpenAI’s latest model, o1 (nicknamed Strawberry), the first major LLM release with this “think, then answer” approach.
Surprisingly, the company Report That approach makes the model much smarter. In a blog post, OpenAI said that o1 “performs similarly to PhD students on challenging benchmark tasks in physics, chemistry and biology. We also found that it excels in math and coding. In a qualifying test for the International Mathematical Olympiad (IMO), GPT-4o solved only 13 percent of the problems correctly, while the logic model got 83 percent.”
This major improvement in the model’s thinking ability also exacerbates some dangerous abilities Leading AI researchers have been looking for a long time Before release, OpenAI’s models are tested for their capabilities Chemical, biological, radiological and nuclear weaponsCapabilities that will be most sought after by terrorist groups that do not have the skills to create them with current technology.
As my colleague Sigal Samuel recently wrote, OpenAI o1 is the first model to score “moderate” risk in this category. While that means it’s not quite capable of walking, say, a complete novice through developing a deadly pathogen, evaluators found it could “help experts with operational plans to reproduce known biological threats.”
These capabilities are among the most obvious examples of AI Dual use technology: A more intelligent model becomes more capable of using a wider array, both benign and malignant.
If AI in the future is good enough to put any college biology major through the steps involved in recreating smallpox in the lab, it will likely cause catastrophic casualties. At the same time, AIs that can teach humans through complex biology projects will do enormous good by accelerating life-saving research. That intelligence, artificial or otherwise, is a double-edged sword.
The key to AI security work is to assess these risks and how to mitigate them through policy so that we get the good without the bad.
How (and how not) to evaluate an AI
Every time OpenAI or one of its competitors (Meta, Google, Anthropic) releases a new model, we re-read the same conversations. Some people find a question on which the AI performs very impressively, and the impressive screenshots go viral. Others find a question on which the AI bombs — saying, “How many ‘Rs’ are there in ‘Strawberry’?‘” or “How do you cross a river with a goat?” — and share them as proof that AI is still more hype than product.
Part of this pattern is driven by the lack of good scientific measures of how capable an AI system is. We had benchmarks to describe AI language and reasoning capabilities, but at a rapid pace of AI development Benchmarks are often ahead of them with “saturated”. This means that AI performs as well as a human in these benchmark tests, and as a result they are no longer useful for measuring further improvements in efficiency.
I strongly recommend that you try them yourself to experience how well they work. (OpenAI o1 is currently only available to paid customers, and even then is very rate-limited, but there are new top model releases all the time.) It’s still too easy to fall into the trap of trying to prove a new release by choosing “impressive” or “overwhelming” tasks. Where they excel or where they embarrass themselves instead of looking at the bigger picture.
The big picture is that, across almost all the tasks we invent for them, AI systems continue to improve rapidly, but the incredible performance we can produce in almost every test has yet to translate into many economic applications. Companies are still struggling to figure out how to monetize the LLM A major obstacle is the inherent unreliability of models, and in principle an approach like OpenAI o1 – where the model gets more time to think before answering – could be a way to greatly improve reliability without the cost of training large models.
Sometimes, big things can come from small improvements
In all likelihood, there is not going to be a silver bullet that suddenly fixes the longstanding limitations of large language models. Instead, I suspect they’ll gradually erode through a series of releases, become something unthinkable attainable, and then become mundane over the course of a few years—which is precisely how AI has progressed so far.
But ChatGPT — which was only a modest improvement over OpenAI’s previous chatbots but which reached millions of people overnight — proves that increasing technological progress doesn’t necessarily mean increasing social impact. Sometimes improving different parts of how an LLM works — or improving its UI so more people try it, like the chatbot itself — pushes us from “party trick” to “essential tool.”
And while OpenAI has recently come under fire for ignoring the security implications of their work and silencing whistleblowers, its o1 release appears to have taken the policy implications seriously, including collaborating with external organizations to test what their model can do. I am grateful that they are making that work possible, and I feel that as models continue to improve, we will need more conscientious work than ever before.
A version of this story originally appeared in the Future Perfect Newsletter. Sign up here!