A MDP Take-away

Last updated on June 3, 2023 am

A MDP Take-away

A Markov decision process take-away.

Problem Formulation

states: a set of all possible states
start_state: starting state
actions(state): with given state, return a list of all valid action
T(state, action, new_state): return probability of s' if taking action a at state s
reward(state, action, new_state): return reward for the given transition
is_end(state): check whether at terminal state
gamma / discount: discount factor for future (not bigger than 1 and not less than 0)

Value Iteration

Value iteration converges quite fast.

initialize v of all states s to 0
for iteration t in 1.. (until convergence or max iteration depth)
- for each state s
  - for a in actions(s)
    - get q value Q(s, a)
  - take maximum of all q values max_q
  - update v of s with max_q
return all v

Q Value

1	`Q(s: state, a: action) -> float`

with given state s and action a:

for all possible new state new_state
- take possibility T(state, action, new_state) as p
- take rewards as r
- item = p * (r + discount * v), where v is value of new_state
- return the sum of all item

Policy Evaluation

Policy is a function which maps from a state to an action, specifying with given state, what action should be applied.

Use policy evaluation to determine whether a policy pi is good enough or outperforming other policies.

initialize v of all states s to 0
for iteration t in 1.. (until convergence or max iteration depth)
- for each state s
  - action a comes form evaluated policy pi
  - update v of s with Q(s, a)

This is done by cutting down all possible actions in value iteration to one action generated by evaluated policy pi.

Policy Iteration

By evaluating and updating policy to find the best policy, i.e., all of the actions given by the policy is optimal and have the highest v value. So does the policy.

randomly initialize policy pi
for iteration t in 1.., until convergence (pi not change) or max iteration depth
- get v of states by policy evaluation
- improve policy
  - for s in all states
    - for a in all valid actions
      - q = Q(s, a)
    - get maximum q as max_q
    - update policy of s to a with max_q

References

Algorithms & Data Structures

#Optimization #MDP #Machine Learning #Artificial Intelligence #Bellman Equation

A MDP Take-away

https://lingkang.dev/2023/05/07/mdp/

Author

Lingkang

Posted on

May 6, 2023

Licensed under

Update Weights in Linear Perceptron Previous

Manipulate 64-bit value with 32-bit registers Next