Multi-Network Training with Moving Average Target

# Multi-Network Training with Moving Average Target When neural networks must learn from each other, they create unstable feedback loops where each network chases the moving targets produced by the other. Traditional gradient descent assumes targets are fixed, but in many scenarios networks must learn from each other simultaneously. This creates a "chicken and egg" problem where: - Network A needs stable targets from Network B to train properly - Network B needs stable targets from Network A to train properly - But both are updating constantly, making their outputs unstable targets **Why This Causes Instability**: Each network update changes the target distribution for the other network, leading to oscillations, divergence, or collapse to trivial solutions. **The Moving Average Solution**: Use exponential moving averages (EMA) to create slowly-evolving "target" versions of networks: ``` target_params = τ * main_params + (1-τ) * target_params ``` Where τ ≈ 0.001-0.01 creates very slow updates. ## Examples ### Target Networks (RL - DQN/DDPG/SAC) - **Problem**: Q-learning update Q(s,a) ← r + γ·max Q(s',a') uses the same network on both sides - **Instability**: As Q-network updates, the targets Q(s',a') keep changing, preventing convergence - **Solution**: Separate target Q-network updated via EMA provides stable targets for hundreds of steps - **Result**: Main network can learn against consistent targets, achieving stable Q-learning ### Momentum Encoders (SSL - MoCo/BYOL) - **Problem**: Learning representations by maximizing agreement between different augmented views - **Instability**: If both encoder networks update together, they can collapse to output identical constant vectors (trivial solution) - **Solution**: Asymmetric setup - "student" encoder trains normally, "teacher" encoder updates via EMA - **Result**: Teacher provides stable, diverse targets while slowly incorporating student's improvements ## Why Moving Averages Work So Well 1. **Temporal Decoupling**: Breaks the circular dependency by introducing a time delay 2. **Stability**: Targets change slowly enough for networks to actually learn from them 3. **Information Preservation**: Still incorporates improvements, just gradually 4. **Regularization Effect**: The averaged network is often more robust than the instantaneous one 5. **Simple Implementation**: Just one line of code to dramatically improve training **The Elegance**: You get the benefits of networks co-evolving and learning from each other, without the chaos of circular dependencies. It's a simple solution to a fundamental problem in multi-network training.