Recursion Is The Next Scaling Law In AI Summary — Y Combinator

Summary

The episode begins by establishing foundational context around RNNs and their historical limitations. Francois explains that RNNs, which were considered essential for AGI around 2016 (notably with Alex Graves' adaptive compute time work), suffered from 'backprop through time' problems—vanishing/exploding gradients and memory requirements that scaled with sequence length. Transformers solved these training-time issues by processing all tokens in parallel using causal masking, but traded away latent compression and inherent recursive reasoning.

The hosts explain a fundamental limitation of LLMs: they are theoretically bounded by their number of layers for certain algorithmic tasks. Using sorting as an example, a transformer with 30 layers cannot perform comparison sort on a 31-element list because there are provably insufficient computational steps. This extends to 'incompressible' problems like Sudoku and mazes that cannot be solved in a single feed-forward pass. Chain-of-thought reasoning provides a workaround by recursing through token output space, but this is bounded by human knowledge in training data and forces reasoning through discrete token space rather than continuous latent space.

The HRM paper (by Sapion) is then detailed. Inspired loosely by the brain's hierarchical frequency processing, it uses three levels of recursion: a low-level module run TL times, a high-level module run TH times, and an outer refinement loop run N times. The key innovation over prior RNN work (like Alex Graves') is using a truncated backprop approach inspired by Deep Equilibrium (DEQ) learning—rather than backpropping through all recursion steps, they backprop through just the two modules once, stop gradients, and re-run with the same input but updated hidden states. This constructs an effective mini-batch across the memory/carry state space rather than across different inputs. A 27-million parameter HRM trained only on ~1,000 ARC Prize tasks achieved ~70% on ARC Prize 1, outperforming GPT-o3 which scored zero.

The TRM paper by Alexia then simplifies and improves on HRM in several ways: it collapses the separate low-level and high-level networks into a single weight-shared network, reduces to a single transformer layer, and backprops through one full recursive loop rather than just through the two modules. This truncated backprop to t=1 is shown to be sufficient, which is counterintuitive. The result is a 7-million parameter model achieving 87% on ARC Prize 1—better than the 28M HRM. Alexia also demonstrates that the outer refinement loop is the primary driver of performance, and that the DEQ fixed-point math justification from HRM doesn't fully hold, but the approach works empirically anyway.

The code walkthrough shows both models share the same basic structure: initialize Z and ZL as zeros, embed input X, run the nested recursion loops with no-grad on the carry states from previous iterations, then backprop through a small final portion of the loop. The key training insight is that not resetting hidden states between gradient steps means each step is effectively a different 'batch' in latent space.

The broader implications discussed include: recursion is a durable trend in AI (Google's recurrent language models cited); truncated backprop at t=1 is a surprisingly powerful and underexplored idea; and the most exciting frontier is combining these tiny recursive models with large LLMs that have rich embedding spaces. The hypothesis is that LLMs excel at finding good latent representations but reason poorly within that space (always routing through token space), while TRMs can reason efficiently within a latent space. Combining both—using LLM embeddings as the substrate for TRM-style recursive reasoning—could yield dramatically more capable systems.

Key Insights

Francois argues that LLMs have a provable computational lower bound limitation: a transformer with N layers cannot perform comparison sort on a list longer than N elements, because there are insufficient computational steps to satisfy the n·log(n) lower bound for comparison-based sorting.

The HRM paper's key innovation over Alex Graves' prior RNN work is using truncated backprop (stopping gradients at a single step rather than through all recursion steps), which avoids vanishing gradient problems while still enabling recursive reasoning via carry/hidden states.

Francois explains that not resetting the hidden states Z and ZL between gradient update steps effectively constructs a mini-batch across latent memory space rather than across different inputs—the same input X is presented 16 times but the model is in a different part of carry space each time.

The TRM paper by Alexia demonstrates that the DEQ (Deep Equilibrium) fixed-point math justification used in HRM doesn't actually hold—the deltas in ZL and ZH don't converge to zero—yet the approach works empirically, and full backprop through one complete recursion loop actually improves performance further.

A 7-million parameter TRM trained only on ARC Prize tasks achieves 87% on ARC Prize 1, outperforming both the 28-million parameter HRM (70%) and GPT-o3 (0%), demonstrating that recursive architecture can substitute for massive parameter counts on reasoning tasks.

Francois argues that chain-of-thought reasoning is bounded by human knowledge in training data and forces reasoning through discrete token space, whereas RNN/TRM-style recursion operates in continuous latent space which is far more expressive—but historically couldn't be trained due to backprop through time constraints.

The outer refinement loop (N_sub iterations during training) is identified as the primary driver of HRM/TRM performance, not the inner hierarchical recursion levels—Constantine's ablation study showed training with 16 outer steps but testing with only 1 still recovers most of the performance.

Francois proposes that the most promising future direction is combining LLMs (which excel at building rich semantic embedding spaces through next-token prediction) with tiny recursive models that can reason within those latent spaces—bypassing the token-space bottleneck that currently limits LLM reasoning.

Topics

Hierarchical Reasoning Models (HRM)Tiny Recursive Models (TRM)Backpropagation through time and its limitationsDeep Equilibrium learning and truncated backpropRecursion vs. chain-of-thought reasoningARC Prize benchmarksIncompressible reasoning problems (Sudoku, mazes)Combining LLM embeddings with recursive reasoning

Summary

Key Insights

Topics

Get AI summaries delivered to your inbox