A MDP Take-away

Last updated on August 13, 2024 am

A MDP Take-away

A Markov decision process take-away.

Problem Formulation

  • states: a set of all possible states
  • start_state: starting state
  • actions(state): with given state, return a list of all valid action
  • T(state, action, new_state): return probability of s' if taking action a at state s
  • reward(state, action, new_state): return reward for the given transition
  • is_end(state): check whether at terminal state
  • gamma / discount: discount factor for future (not bigger than 1 and not less than 0)

Value Iteration

Value iteration converges quite fast.

  • initialize v of all states s to 0
  • for iteration t in 1.. (until convergence or max iteration depth)
    • for each state s
      • for a in actions(s)
        • get q value Q(s, a)
      • take maximum of all q values max_q
      • update v of s with max_q
  • return all v

Q Value

1
Q(s: state, a: action) -> float

with given state s and action a:

  • for all possible new state new_state
    • take possibility T(state, action, new_state) as p
    • take rewards as r
    • item = p * (r + discount * v), where v is value of new_state
    • return the sum of all item

Policy Evaluation

Policy is a function which maps from a state to an action, specifying with given state, what action should be applied.

Use policy evaluation to determine whether a policy pi is good enough or outperforming other policies.

  • initialize v of all states s to 0
  • for iteration t in 1.. (until convergence or max iteration depth)
    • for each state s
      • action a comes form evaluated policy pi
      • update v of s with Q(s, a)

This is done by cutting down all possible actions in value iteration to one action generated by evaluated policy pi.

Policy Iteration

By evaluating and updating policy to find the best policy, i.e., all of the actions given by the policy is optimal and have the highest v value. So does the policy.

  • randomly initialize policy pi
  • for iteration t in 1.., until convergence (pi not change) or max iteration depth
    • get v of states by policy evaluation
    • improve policy
      • for s in all states
        • for a in all valid actions
          • q = Q(s, a)
        • get maximum q as max_q
        • update policy of s to a with max_q

References

  1. Markov Decision Processes 1 - Value Iteration | Stanford CS221: AI (Autumn 2019) - YouTube
  2. Lecture 17 - MDPs & Value/Policy Iteration | Stanford CS229: Machine Learning Andrew Ng (Autumn2018) - YouTube
  3. [CS188 FA22] Lecture 7 - MDPs I - YouTube
  4. cs188-fa22-note07.pdf
  5. Value Iteration vs. Policy Iteration in Reinforcement Learning | Baeldung on Computer Science
  6. 4.3 Policy Iteration
  7. Week 4 Learning - CS50’s Introduction to Artificial Intelligence with Python