# Hierarchical Reasoning Model (HRM)
Date: 2025
URL: [\[2506.21734\] Hierarchical Reasoning Model](https://arxiv.org/abs/2506.21734)
Log:
- 2026-01-12: ARC people did ablations and analysis, citing major gains comes just from plain old recurrence: [The Hidden Drivers of HRM's Performance on ARC-AGI](https://arcprize.org/blog/hrm-analysis) Refining your predictions works, who knew!
- 2026-01-12: [[Tiny Reasoning Model (TRM)]] refines and simplifies this further without having to rely on fishy biological inspirations, while achieving significant scores on the same benchmarks!
---
> The human brain provides a compelling blueprint for achieving the effective computational depth that contemporary artificial models lack. It organizes computation hierarchically across corti- cal regions operating at different timescales, enabling deep, multi-stage reasoning20,21,22. Recur- rent feedback loops iteratively refine internal representations, allowing slow, higher-level areas to guide, and fast, lower-level circuits to execute—subordinate processing while preserving global coherence23,24,25. Notably, the brain achieves such depth without incurring the prohibitive credit- assignment costs that typically hamper recurrent networks from backpropagation through time19,26.
They cite brain as being inspiration for few key components — hierarchy of processing happening at different timescales, iterative refinements of the representations, bypassing [[Backpropagation Through Time (BPTT)]] with a single update (no unrolling), and dynamic compute allocation through a Q-learning mechanism. Noticeably, there is no ablation study.
Four learnable networks:
- input network
- low-level recurrent module that operates at input timesteps T and maintains a hidden state (L-module)
- high-level recurrent module that operates at "high-level cycles" N and also maintains a hidden state (H-module)
- output network
> At each timestep i, the L-module updates its state conditioned on its own previous state, the H- module’s current state (which remains fixed throughout the cycle), and the input representation. The H-module only updates once per cycle (i.e., every T timesteps) using the L-module’s final state at the end of that cycle:
> Finally, after N full cycles, a prediction ˆy is extracted from the hidden state of the H-module:
They call the entire NT-timestep process a single forward pass. This iterative process of refinement, which looks a lot like [[Expectation Maximization]], they term as hierarchical convergence.
They also get rid of BPTT saying if RNNs converge to a fixed point, unrolling of its state sequence is not necessary, also citing some brain research that "cortical credit assignment relies on local mechanisms". Seems to be based off of [Deep Equilibrium Models](https://proceedings.neurips.cc/paper/2019/hash/01386bd6d8e091c2ab4c7c7de644d37b-Abstract.html).
> we propose a one-step approximation of the HRM gradient–using the gradient of the last state of each module and treating other states as constant. The gradient path is, therefore,
>
> Output head → final state of the H-module → final state of the L-module → input embedding
Now they introduce their [[Deep Supervision with Recursion|deep supervision]], which runs multiple forward passes, but crucially they detach the z before param update. They say this provides more frequent feedback to the H-module and serves as a regularization mechanism, but probably mostly superior empirical performance.
The [[Temporal Difference Learning#Q-Learning|Q learning]] mechanism takes the final state of H-module to predict Q values for "halt" and "continue" actions through a linear layer. Whenever the N_supervision exceeds maximum threshold, or when halting Q value exceeds that of continue, the loop is stopped. Not sure if this is strictly RL or just plain supervised learning.
The final loss is a sequence to sequence loss with typical softmax output, combined with BCE for training the Q values.
$L_{\mathrm{ACT}}^m=\operatorname{LOSS}\left(\hat{y}^m, y\right)+\operatorname{BINARYCrOSSENTROPY}\left(\hat{Q}^m, \hat{G}^m\right)$.
![[hierarchical-reasoning-model 1.png]]