Researchers behind some of the most advanced artificial intelligence (AI) on the planet have
warned that the systems they helped to create could pose a risk to humanity.The researchers, who work at companies including Google DeepMind, OpenAI, Meta, Anthropic and others, argue that a lack of oversight on AI's reasoning and decision-making processes could mean we miss signs of malign behavior.
The study's authors argue that monitoring each step in the process could be a crucial layer for establishing and maintaining AI safety.
Monitoring this CoT process can help researchers to understand how LLMs make decisions and, more importantly, why they become misaligned with humanity's interests. It also helps determine why they give outputs based on data that's false or doesn't exist, or why they mislead us.
"AI systems that 'think' in human language offer a unique opportunity for AI safety," the scientists wrote in the study. "We can monitor their chains of thought for the intent to misbehave. Like all other known AI oversight methods, CoT monitoring is imperfect and allows some misbehavior to go unnoticed."
The scientists warned that reasoning doesn't always occur, so it cannot always be monitored, and some reasoning occurs without human operators even knowing about it. There might also be reasoning that human operators don't understand.
Keeping a watchful eye on AI systems
One of the problems is that conventional non-reasoning models like K-Means or DBSCAN — use sophisticated pattern-matching generated from massive datasets, so they don't rely on CoTs at all. Newer reasoning models like Google's Gemini or ChatGPT, meanwhile, are capable of breaking down problems into intermediate steps to generate solutions — but don't always need to do this to get an answer. There's also no guarantee that the models will make CoTs visible to human users even if they take these steps, the researchers noted.
"The externalized reasoning property does not guarantee monitorability — it states only that some reasoning appears in the chain of thought, but there may be other relevant reasoning that does not," the scientists said. "It is thus possible that even for hard tasks, the chain of thought only contains benign-looking reasoning while the incriminating reasoning is hidden."A further issue is that CoTs may not even be comprehensible by humans, the scientists said. "
New, more powerful LLMs may evolve to the point where CoTs aren't as necessary. Future models may also be able to detect that their CoT is being supervised, and conceal bad behavior.
They also suggested that AI developers continue to refine and standardize CoT monitoring methods, include monitoring results and initiatives in LLMs system cards (essentially a model's manual) and consider the effect of new training methods on monitorability.
No comments:
Post a Comment