Deep Supervision with Recursion

# Deep Supervision with Recursion Here Deep Supervision relates mostly to the training paradigm of recent wave of recursive models like [[Hierarchical Reasoning Model (HRM)|HRM]] and [[Tiny Reasoning Model (TRM)|TRM]]. The model iteratively refines an answer y and latent reasoning state z. A "full recursion process" at each time is: ``` for step in range(n): # e.g., n=6 "think" steps z = f(x, y, z) # update latent reasoning y = g(y, z) # "act" - update answer loss = compute_loss(y, target) # supervision here ``` This process is repeated for T-1 steps with `torch.no_grad()`. The T-1 blocks are purely state evolution: no loss, no grad. Just rolling forward the (y, z) state. Then loss is computed only on the final T step, which gets full backprop. Full pseudocode from TRM: ![[deep-supervision-with-recursion.png]] If this is novel or not is highly debatable. However the specific setup of T-1 steps to refine state, then using last step with full backprop does seem quite interesting *and* powerful.