# Reinforcement Learning Problem Setup ## Introduction Reinforcement learning provides the capacity for us not only to teach an artificial agent how to act, but to allow it to learn through it’s own interactions with an environment. By combining the complex representations that deep neural networks can learn with the goal-driven learning of an RL agent, computers have accomplished some amazing feats, like beating humans at over a dozen Atari games and defeating the Go world champion. > Artificial Intelligence = Reinforcement Learning + Deep Learning - David Silver (2016) Reinforcement Learning are the classes of algorithms that enables computers to experiment continually in a simulated environment and teach themselves how to act in most optimal way possible. It is based on a simple idea that an "agent" when reinforced with reward on performing a good action will try to take actions that will maximises the future reward. How is it different from supervised learning setup? 1. In supervised learning, the goal is to learn a mapping from inputs to output labels, but in reinforcement Learning it is to learn a 'policy' that defines how agent will act in the given environment. 2. RL algorithm are capable of multistep decision making i.e. perform a series of actions but supervised algorithms are capable of only single step prediction problem. 3. RL algorithms do not need huge amount of human labelled data like supervised learning algorithm, and are guided by the reward signal provided by the environment. Here we describe the standard problem setup of reinforcement learning. ## Environment ### State State $s$ is representation of a location in the environment. It contains data that is used by RL algorithm. ### Action Action $a$ is the possible response that could be performed while the agent is at any state. Agent's actions affects the subsequent data it receives. ### Reward Reward $r$ is a scalar feedback provided by the environment after a particular action is taken in a given state. The total reward that the agent receives is delayed and not instantaneous. The agent can prioritise over immediate or future reward. The return $R_t = \sum_{k=0}^\infty γ^kr_{t+k}$ is the discounted, accumulated reward with the discount factor γ ∈ (0, 1]. Discount factor γ=0 means agent is very short sighted whereas γ=1 means the agent is very careful about future possible rewards. The agent aims to maximize the expectation of such long term return from each state. ## Agent A RL agent interacts with the environment over time. It's job is to maximise total future reward as it performs different actions and lands in different state. It may include one or more of these components: ### Policy Policy is agent's behaviour function, represented as $\pi(a|s)$. It is essentially a map from state to action. The goal of RL is to learn the policy which maximises the reward (optimal policy). Policies can either be stocastic $\pi(a \mid s)$ or, deterministic $\pi(s)$. Deterministic policy can be generated from stochastic in "greedy" manner by taking the argmax. ### Value Function The value function is prediction of expected future reward. Value function is used to evaluate how good or bad a state or state-action pair is. ## Model Model is the representation of how the agent thinks the environment works. It consists of: ### State Transition Function State transition function $P(s_{t+1}|s_t,a_t)$ models the dynamics of the environment. It predicts the next state, given current state and action taken. ### Reward Function Reward function $R(s,a)$ predicts the next immediate reward given current state and action taken. RL algorithms are categorized as [[Model Free Reinforcement Learning]] and [[Model Based Reinforcement Learning]] . Model Based algorithms use predefined or approximated state transition function and reward functions to figure out its policy. Model free algorithms don't figure out how the environment works, but use their experience to maintain a value function. To sum it up, at each timestep $t$ the agent is on a state $s_t$, it performs action $a_t$ by following a policy $\pi(a_t|s_t)$, then receives a scalar reward $r_t$ according to reward function $R(s,a)$ and transitions to next state $s_{t+1}$ according to state transition probability $P(s_{t+1}|s_t,a_t)$. ## Markov Decision Processes [[Markov Decision Processes]] ## Reinforcement Learning Ojective $ \underset{\pi}{\operatorname{maximize}} \underset{a_{t} \sim \pi\left(\cdot \mid s_{t}\right) \atop s_{t+1} \sim p\left(\cdot \mid s_{t}, a_{t}\right)}{\mathbb{E}}\left[\sum_{t=0}^{\infty} \gamma^{t} r\left(s_{t}, a_{t}\right)\right] $ ## The Value function $V^{\pi}(s)$ is the state-value function of MDP (Markov Decision Process). It's the expected return starting from state $s$ following policy $\pi$. $ V^{\pi}(s)=E_{\pi}\left\{G_{t} \mid s_{t}=s\right\} $ $G_{t}$ is the total DISCOUNTED reward from time step $t,$ as opposed to $R_{t}$ which is an immediate return. Here you are taking the expectation of ALL actions according to the policy $\pi$. ## The Q function $Q^{\pi}(s, a)$ is the action-value function. It is the expected return starting from state $s$, following policy $\pi,$ taking action $a$. It's focusing on the particular action at the particular state. $ Q^{\pi}(s, a)=E_{\pi}\left\{G_{t} \mid s_{t}=s, a_{t}=a\right\} $ The relationship between $Q^{\pi}$ and $V^{\pi}$ (the value of being in that state) is $ V^{\pi}(s)=\sum_{a \in A} \pi(a \mid s) * Q^{\pi}(a, s) $ You sum every action-value multiplied by the probability to take that action (the policy $\pi(a \mid s))$ ## When to use V vs Q With a model, state values alone are sufficient to determine a policy; one simply looks ahead one step and chooses whichever action leads to the best combination of reward and next state, as in [[Dynamic Programming (RL)]]. Without a model, however, state values alone are not suficient. One must explicitly estimate the value of each action in order for the values to be useful in suggesting a policy. Q-values are a great way to the make actions explicit so you can deal with problems where the transition function is not available (model-free). However, when your action-space is large, things are not so nice and Q-values are not so convenient. They are harder to compute in spaces with a huge number of actions or even continuous action-spaces. From a sampling perspective, the dimensionality of $Q(s, a)$ is higher than $V(s)$ so it might get harder to get enough $(s, a)$ samples in comparison with $(s)$. If you have access to the transition function sometimes $V$ is good. There are also other uses where both are combined, for example [[PGT Actor-Critic]] architectures. --- ## References 1. Chapter 1, RL:AI Sutton and Barto 2. https://medium.com/@awjuliani/super-simple-reinforcement-learning-tutorial-part-1-fd544fab149 3. http://www.wildml.com/2016/10/learning-reinforcement-learning/ 4. https://www.alexirpan.com/2018/02/14/rl-hard.html 5. https://datascience.stackexchange.com/a/10358