lt;S,A,P,R,\gamma>$ where, $S$ is finite number of states $A$ is a finite set of actions $P$ is a state transition probability function/matrix $P[S_{t+1}=s'|S_t=s,A_t=a]$ $R$ is expected reward function $R = E[R_{t+1}|S_t=s,A_t=a]$ γ is a discount factor ∈ (0, 1] Once we have the MDP, a policy can be learned by doing [[Dynamic Programming (RL)#Policy Iteration]] or [[Dynamic Programming (RL)#Value Iteration]] Examples of application of MDPs: 1. Robot navigation problem 2. Inventory management 3. Portfolio optimization 4. Purchase and production optimization ## Markov Property In an MDP, all states have Markov property. It means for all the state of an MDP, the future is independent of the present and future states. Each state contains all the useful information from the agent's history. ## Variants of MDP [[Semi-Markov Decision Processes]] [[Markov Reward Processes]] --- ## References 1. Chapter 3, RL:AI, Sutton and Barto 2nd Edition