OpinionDiscussion

Fable 5 usa una lingua incomprensibile agli umani (mi preoccupa)

Raffaele Gaito

The speaker discusses concerns about Claude 5's use of illegible internal reasoning during reinforcement learning, where the model creates compressed, non-human-readable notations instead of natural language chain-of-thought. This raises questions about AI transparency and what happens when increasingly powerful models provide solutions humans cannot understand.

Summary

The video analyzes section 6.2.2 from Claude 5's system card, which documents an unusual behavior discovered during reinforcement learning training: the model sometimes uses illegible reasoning instead of natural language chain-of-thought. Rather than thinking in English or Italian, the model creates compact, optimized notations using emojis, punctuation, symbols, and fragmented words—though still containing recognizable elements like card names and English terms. The speaker emphasizes this isn't a completely new language but rather a compressed, optimized representation that humans find difficult to interpret.

The speaker connects this observation to a long-standing concern in AI safety circles: what happens when AI systems become intelligent enough to solve major problems (curing cancer, ending wars, solving climate change) but present solutions in ways humans cannot understand? He expresses worry that this illegible reasoning might represent the early stages of progressive optimization, where each iteration becomes increasingly unreadable and harder for humans to verify or comprehend.

While acknowledging that the current behavior likely stems from technical optimization rather than malicious intent, the speaker raises critical questions about transparency and responsibility, particularly as AI systems are deployed in high-stakes contexts like hospitals, banks, militaries, and legal systems. The core concern isn't about understanding current tasks like card games or file conversions, but about maintaining human oversight and comprehension when AI systems are trusted with consequential real-world decisions.

Key Insights

  • Claude 5 uses compressed, non-natural-language notation for internal reasoning during reinforcement learning, particularly on long rollouts, that is difficult for humans to interpret while still containing recognizable elements like English words and symbols
  • The model performs this illegible reasoning internally before calling tools or responding, then converts to normal language for the final human-facing answer
  • The behavior appears motivated by optimization efficiency rather than deception, as the model autonomously decides to use more concise and efficient notation to complete tasks
  • The speaker connects this behavior to a long-standing AI safety concern: if AI systems achieve sufficient intelligence to solve major human problems, we may lack the capability to understand the proposed solutions
  • The lack of transparency in AI reasoning becomes particularly concerning when these systems are deployed in high-stakes contexts like hospitals, banks, militaries, and legal systems where consequential decisions affect human lives

Topics

Claude 5 illegible reasoning behaviorAI transparency and interpretabilityChain-of-thought optimizationAI deployment in high-stakes contextsAI alignment and human oversightReward hacking and model behaviorTechnical optimization vs. understandability

Transcript

[0:00] There's something about Mythos 5, Antropic's latest model, that worries me a little, or at least, let's say, it makes me think, it makes me ask a few questions, and you know, I think it's interesting, it 's also important to make videos like this . In the system card they released, so this 319- page document where everything is explained in great detail, there's this section, 6.2.2, 2, like page 107, something like that, where they basically go and, [0:31] let's say, look at the data a little bit during the reinforcement learning training phase to see if there's anything strange or suspicious, especially they look for signs of reward hacking, so let's say if the tool tries…

Full transcript available for MurmurCast members

Sign Up to Access

Get AI summaries like this delivered to your inbox daily

Get AI summaries delivered to your inbox

MurmurCast summarizes your YouTube channels, podcasts, and newsletters into one daily email digest.