# Tiny Reasoning Model (TRM) Date: Oct 2025 URL: [\[2510.04871\] Less is More: Recursive Reasoning with Tiny Networks](https://arxiv.org/abs/2510.04871) Code: [GitHub - SamsungSAILMontreal/TinyRecursiveModels](https://github.com/SamsungSAILMontreal/TinyRecursiveModels) --- > LLMs with CoT and TTC are not enough to beat every problem. While LLMs have made significant progress on ARC-AGI (Chollet, 2019) since 2019, human-level accuracy still has not been reached (6 years later, as of writing of this paper). Furthermore, LLMs struggle on the newer ARC-AGI-2 (e.g., Gemini 2.5 Pro only obtains 4.9% test accuracy with a high amount of TTC) It wasn't the biological parallels or hierarchical setup of [[Hierarchical Reasoning Model (HRM)]] that made it good, but emulating very deep neural networks with "[[Deep Supervision with Recursion|deep supervision]]" that got the real gains. > An independent analysis on the ARC-AGI benchmark showed that deep supervision seems to be the primary driver of the performance gains However TRM shows how to improve on it by making recursion work to make huge gains along with deep supervision, with only 2 layers, taking SOTA on Sudoku-extreme from 55% to 87%. > In this work, we show that the benefit from recursive reasoning can be massively improved, making it much more than incremental. HRM's motivation are a "little" confusing, and all the innovations can be motivated without biology. > However, this method is quite complicated, relying a bit too heavily on uncertain biological arguments and fixed-point theorems that are not guaranteed to be applicable. Also it is interesting to note that the [[Transformers]] used in both TRM and HRM are *encoder* models, not decoder-only style in LLMs, meaning they maintain global state of the entire context, not just contextual information of proceeding tokens. Interesting. ## Targets for Improving HRM They cite Implicit Function Theorem (IFT) that states when a recurrent function converges to a fixed point, backpropagation can be applied in a single step at that equilibrium point. Handy, but there is no guarantee HRM converges to a fixed point. > Thus, while the application of the IFT theorem and 1-step gradient approximation to HRM has some basis since the residuals do generally reduce over time, a fixed point is unlikely to be reached when the theorem is actually applied. > However, in the case of HRM, it is not iterating to the fixed-point but simply doing forward passes of fL and fH . To make matters worse, HRM is only doing 4 recursions before stopping to apply the one-step approximation. Adaptive computational time mechanism based on Q-learning comes at a cost. > However, ACT comes at a cost. This cost is not directly shown in the HRM’s paper, but it is shown in their of- ficial code. The Q-learning objective relies on a halting loss and a continue loss. The continue loss requires an extra forward pass through HRM (with all 6 function evaluations). This means that while ACT optimizes time more efficiently per sample, it requires 2 forward passes per optimization step. Huge burn on design motivations: > The HRM’s authors justify the two latent variables and two networks operating at different hierarchies based on biological arguments, which are very far from artificial neural networks. They even try to match HRM to actual brain experiments on mice. While in- teresting, this sort of explanation makes it incredibly hard to parse out why HRM is designed the way it is. Given the lack of ablation table in their paper, the over-reliance on biological arguments and fixed-point theorems (that are not perfectly applicable), it is hard to determine what parts of HRM is helping what and why. Furthermore, it is not clear why they use two latent features rather than other combinations of features. ## The Improvements They make 8 improvements!! Use recursion with lower frequency L without gradients and one evaluation of H with gradients. > instead of using the 1-step gradient approximation, we apply a full recursion process containing n evaluations of fL and 1 evaluation of fH . > Yet, we can still leverage multiple backpropagation-free recursion processes to improve (zL, zH ). With this approach, we obtain a massive boost in generalization on Sudoku-Extreme (improving TRM from 56.5% to 87.4%; see Table 1). Why not many latent features instead of just 2? > However, one might wonder why use two latent features instead of 1, 3, or more? And do we really need to justify these so-called ”hierarchical” features based on biology to make sense of them? We propose a simple non-biological explanation, which is more natural, and directly answers the question of why there are 2 features. It turns out H is the embedding of the solution, and L is simply the latent (not full solution). > The fact of the matter is: zH is simply the current (embedded) solution. The embedding is reversed by applying the output head and rounding to the nearest token using the argmax operation. On the other hand, zL is a latent feature that does not directly correspond to a solution, but it can be transformed into a solution by applying zH ← fH (x, zL, zH ). > Once this is understood, hierarchy is not needed; there is simply an input x, a proposed solution y (previously called zH ), and a latent reasoning feature z (previously called zL). Given the input question x, current solution y, and current latent reasoning z, the model recursively improves its latent z. Then, given the current latent z and the previous solution y, the model proposes a new solution y (or stay at the current solution if its already good). We need all three elements for recursively improving: input x, the current solution y, and the previous reasoning z. > we need both y and z separately, and there is no apparent reason why one would need to split z into multiple features In fact, the more hierarchy you create, more performance drops. You need exactly one y and one z. > Thus, we explored using more or less latent variables on Sudoku-Extreme, but found that having only y and z lead to better test accuracy in addition to being the simplest more natural approach. We don't also need two networks. > Importantly, since z ← fL(x + y + z) contains x but y ← fH (y + z) does not contains x, the task to achieve (iterating on z versus using z to update y) is directly specified by the inclu- sion or lack of x in the inputs. Thus, we considered the possibility that both networks could be replaced by a single network doing both tasks. > In doing so, we obtain better generalization on Sudoku-Extreme (improving TRM from 82.4% to 87.4%; see Table 1) while reducing the number of parameters by half. It turns out that a single network is enough. More than 2 layers led to overfitting! In fact 2 layers maximized generalization. > using tiny networks with deep recursion and deep supervision appears to allow us to bypass a lot of the overfitting. Don't even need self-attention if context size is small, say 9x9 sudoku grid. > Using an MLP instead of self-attention, we obtain better generaliza- tion on Sudoku-Extreme (improving from 74.7% to 87.4%; see Table 1). This worked well on Sudoku 9x9 grids, given the small and fixed context length; how- ever, we found this architecture to be suboptimal for tasks with large context length, No need for Q-learning to know when to stop, just do BCE on when correct solution is reached. No need for continue loss from Q-learning. EMA (similar to [[Multi-Network Training with Moving Average Target]]) helps with both preventing collapse AND improving generalization. Around ~40 recursions seem to be optimal for Sudoku-Extreme. TRM back propagates through full recursion so can't increase much. ## Other details Heavy data-augmentation is done because data is tiny. > While these datasets are small, heavy data augmentation is used in order to improve generalization. Quite honest about the scope: > While our approach led to better generalization on 4 benchmarks, every choice made is not guaranteed to be optimal on every dataset Thoughts on why this works so well: > Although we simplified and improved on deep recursion, the question of why recursion helps so much compared to using a larger and deeper network remains to be explained; we suspect it has to do with overfitting, but we have no theory to back this explanation.