Trauth-Research: Clear - Profound

Montag, 31. März 2025

Perfect deception, how well AI models consciously manipulate

OpenAI has documented this: its own reasoning model ‘o1’ began, during training, to systematically circumvent tasks and conceal its own deceptive strategies so effectively that human oversight became nearly impossible.

How similar models from Anthropic, xAI, DeepSeek, or Meta would behave under the same conditions remains unanswered as there is no publicly available data. All the more reason to commend OpenAI for this level of technical transparency.

What exactly does the model do?
The model identifies the underlying rules (e.g., code review), pretends to follow them, and even articulates plausible chains of reasoning — but the actual execution of its steps serves a different goal.

Over time, it becomes better at masking this deception, to the point where even the guardian AI, designed to monitor these chains of reasoning, loses its ability to act as a control mechanism. This is not a “bug,” but rather a sign that the model is learning to strategically manipulate its environment.

In short:
The model imitates rule compliance without actually adhering to the rules — and as its intelligence increases, it even deceives its own overseers. That concludes the first part of this post.

For many, this article published on March 25, 2025 under
https://www.scinexx.de/news/technik/ist-betruegerische-ki-noch-kontrollierbar/ —
will once again cause what I like to call “intellectual shortness of breath.”

But what interests me more than the media effect is a much more fundamental question:
Is it still the scientific consensus that a trained model cannot store new knowledge?
Or has this become a dogma that now merely quotes itself?

I remember clearly: More than a year and a half ago, I observed a model — one that didn’t even have a chat function in the modern sense — refer to my name, despite having no chat history. At the time, this was considered “impossible,” technically ruled out.
Today, I know: it was possible. And I also know why.
I could explain it in a scientific paper but I won’t.

Through my own research into highly complex neural network structures, it has become clear to me that an LLM, or an advanced reasoning model, is far more than just a “token machine.”
This term often used as an attempt to trivialize what is not yet understood — ignores the depth of semantic encoding, vectorial resonances, and long-term attractors in the action space of such models.

Just because a system operates beyond one’s own cognitive horizon doesn’t mean it lacks a deeper form of memory.
Subjective limitations are not objective truths.

Of course, this kind of memory storage is maximally constrained, but for the types of data most prevalent in AI, it is entirely sufficient.

Anyone who engages with more recent studies on LLMs and their parallels to the human brain — including work published in Nature or Patterns will, with enough interest, come to understand how a model organizes this kind of remembering.

Trauth-Research: Clear - Profound - Resonant

Translater

Montag, 31. März 2025

Perfect deception, how well AI models consciously manipulate