lt;\mathrm{A}, \mathrm{B}>$ ) to the current algorithm generated thus far. - A reward $r_t$ is received that comprises both a measure of algorithm correctness and latency. Algorithm correctness (Fig. 2b) involves inputting a set of $N$ test sequences into the current algorithm $\mathbf{P}_{\mathbf{t}}$ to generate N outputs. These outputs are then compared to the expected outputs and a correctness reward $r_t$ is computed. Latency rewards can be generated by either (1) penalizing the agent for increasing the length of the algorithm (when length and latency are highly correlated) that we refer to as the algorithm length reward, or (2) measuring the actual latency of the algorithm. - The game is executed for a limited number of steps, after which the game is terminated. Winning the game corresponds to generating a correct, low-latency algorithm using assembly instructions. Losing the game corresponds to generating an incorrect algorithm or a correct but inefficient algorithm. - Learning procedure: - The agent's primary learning algorithm is an extension of the [[AlphaZero]] agent and guides a [[Monte-Carlo Tree Search]] planning procedure using a deep neural network. The input to the neural network is the state $\mathbf{S}_{\mathbf{t}}$ and the output is a policy and value prediction. The policy prediction is a distribution over actions and the value function is a prediction of the cumulative returns $R$ that the agent should expect to receive from the current state $\mathbf{S}_{\mathbf{t}}$. - During a game, the agent receives as input the current state $\mathbf{S}_{\mathbf{t}}$. The agent then executes an MCTS procedure and uses this to select the next action to take. The generated games are then used to update the network's parameters, enabling the agent to learn. - AlphaDev has two value function heads: one predicting algorithm correctness and the second predicting algorithm latency. The latency head is used to directly predict the latency of a given program by using the program’s actual computed latency as a Monte Carlo target for AlphaDev during training. This dual-head approach achieved substantially better results than the vanilla, single head value function setup when optimizing for real latency. - Representation: - It is critical that AlphaDev has a representation capable of representing complex algorithmic structures to efficiently explore the space of instructions. - To achieve this, we introduce the AlphaDev representation network that is comprised of two networks: - Transformer encoder network that provides the agent with a representation of the algorithm structure - The CPU state encoder network that helps the agent predict how the algorithm affects the dynamics of memory and registers. The CPU state encoder network comprises a multilayer perceptron that receives as input the state of each register and memory location for a given set of inputs. - These networks each output embeddings that are combined to yield the AlphaDev state representation. - - RL vs stochastic search optimization - In general, we find that AlphaDev consistently outperforms AlphaDev-S when learning from scratch without previous knowledge. - In addition, as the size of the program increases, AlphaDev explores orders of magnitude fewer programs ( 12 million programs in the worst case) compared to AlphaDev-S (31 trillion programs in the worst case). This may be because AlphaDev is able to better explore the space of algorithms compared to the breadth-first stochastic search procedure that gets stuck more easily into local optima. - In addition, AlphaDev never evaluates latency during search as it uses the latency value function predictions and, because of this, only needs to compute actual measured latency on less than $0.002 \%$ of generated programs. ## The details ![[alphadev_architecture.png]] - The learning paradigm is exactly the same setup as [[MuZero]] - Correctness cost (used in AlphaDev-S) - For the correctness cost function, we implemented three types of cost function. The first one is defined as the percentage of incorrectly placed items: $\frac{P-P C_t}{P}$ where $P$ is the total number of items to place and $\mathrm{PC}_t$ is number of correctly placed items at timestep $t$. The second variant is the square root of this equation. - The final cost function takes the square root of the difference $\sqrt{-P C_t}$ and this is what yielded the best performance. - Code availability - Pseudocode at https://github.com/deepmind/alphadev that includes the environment, the full actor and training loops as well as the core MCTS algorithm. - In addition, we include our actual JAX implementation of our policy, value and representation networks that enable the architectures to be reproduced. - Finally, we have a config file containing the hyper-parameter definitions to be used with the agent. ## The results - New SOTA sorting algorithms for fixed and variable size sequences discovered.