Markov Decision Processes

# Markov Decision Processes Markov decision processes formally describe the framework for reinforcement learning, enabling mathematical treatment of the algorithms. Almost all RL problems can be formalised as MDPs. Formally, MDP is a tuple lt;S,A,P,R,\gamma>$ where, $S$ is finite number of states $A$ is a finite set of actions $P$ is a state transition probability function/matrix $P[S_{t+1}=s'|S_t=s,A_t=a]$ $R$ is expected reward function $R = E[R_{t+1}|S_t=s,A_t=a]$ γ is a discount factor ∈ (0, 1] Once we have the MDP, a policy can be learned by doing [[Dynamic Programming (RL)#Policy Iteration]] or [[Dynamic Programming (RL)#Value Iteration]] Examples of application of MDPs: 1. Robot navigation problem 2. Inventory management 3. Portfolio optimization 4. Purchase and production optimization ## Markov Property In an MDP, all states have Markov property. It means for all the state of an MDP, the future is independent of the present and future states. Each state contains all the useful information from the agent's history. ## Variants of MDP - [[Semi-Markov Decision Processes]] - [[Markov Reward Processes]] - [[Partial Observability]] ## References 1. Chapter 3, RL:AI, Sutton and Barto 2nd Edition