Hierarchical Reasoning Model (HRM)

# Hierarchical Reasoning Model (HRM) Date: 2025 URL: [\[2506.21734\] Hierarchical Reasoning Model](https://arxiv.org/abs/2506.21734) Log: - 2026-01-12: ARC people did ablations and analysis, citing major gains comes just from plain old recurrence: [The Hidden Drivers of HRM's Performance on ARC-AGI](https://arcprize.org/blog/hrm-analysis) Refining your predictions works, who knew! - 2026-01-12: [[Tiny Reasoning Model (TRM)]] refines and simplifies this further without having to rely on fishy biological inspirations, while achieving significant scores on the same benchmarks! --- > The human brain provides a compelling blueprint for achieving the effective computational depth that contemporary artificial models lack. It organizes computation hierarchically across corti- cal regions operating at different timescales, enabling deep, multi-stage reasoning20,21,22. Recur- rent feedback loops iteratively refine internal representations, allowing slow, higher-level areas to guide, and fast, lower-level circuits to execute—subordinate processing while preserving global coherence23,24,25. Notably, the brain achieves such depth without incurring the prohibitive credit- assignment costs that typically hamper recurrent networks from backpropagation through time19,26. They cite brain as being inspiration for few key components — hierarchy of processing happening at different timescales, iterative refinements of the representations, bypassing [[Backpropagation Through Time (BPTT)]] with a single update (no unrolling), and dynamic compute allocation through a Q-learning mechanism. Noticeably, there is no ablation study. Four learnable networks: - input network - low-level recurrent module that operates at input timesteps T and maintains a hidden state (L-module) - high-level recurrent module that operates at "high-level cycles" N and also maintains a hidden state (H-module) - output network > At each timestep i, the L-module updates its state conditioned on its own previous state, the H- module’s current state (which remains fixed throughout the cycle), and the input representation. The H-module only updates once per cycle (i.e., every T timesteps) using the L-module’s final state at the end of that cycle: > Finally, after N full cycles, a prediction ˆy is extracted from the hidden state of the H-module: They call the entire NT-timestep process a single forward pass. This iterative process of refinement, which looks a lot like [[Expectation Maximization]], they term as hierarchical convergence. They also get rid of BPTT saying if RNNs converge to a fixed point, unrolling of its state sequence is not necessary, also citing some brain research that "cortical credit assignment relies on local mechanisms". Seems to be based off of [Deep Equilibrium Models](https://proceedings.neurips.cc/paper/2019/hash/01386bd6d8e091c2ab4c7c7de644d37b-Abstract.html). > we propose a one-step approximation of the HRM gradient–using the gradient of the last state of each module and treating other states as constant. The gradient path is, therefore, > > Output head → final state of the H-module → final state of the L-module → input embedding Now they introduce their [[Deep Supervision with Recursion|deep supervision]], which runs multiple forward passes, but crucially they detach the z before param update. They say this provides more frequent feedback to the H-module and serves as a regularization mechanism, but probably mostly superior empirical performance. The [[Temporal Difference Learning#Q-Learning|Q learning]] mechanism takes the final state of H-module to predict Q values for "halt" and "continue" actions through a linear layer. Whenever the N_supervision exceeds maximum threshold, or when halting Q value exceeds that of continue, the loop is stopped. Not sure if this is strictly RL or just plain supervised learning. The final loss is a sequence to sequence loss with typical softmax output, combined with BCE for training the Q values. $L_{\mathrm{ACT}}^m=\operatorname{LOSS}\left(\hat{y}^m, y\right)+\operatorname{BINARYCrOSSENTROPY}\left(\hat{Q}^m, \hat{G}^m\right)$. ![[hierarchical-reasoning-model 1.png]]