A MDP Take-away
Last updated on August 13, 2024 am
A MDP Take-away
A Markov decision process take-away.
Problem Formulation
states
: a set of all possible statesstart_state
: starting stateactions(state)
: with given state, return a list of all valid actionT(state, action, new_state)
: return probability ofs'
if taking actiona
at states
reward(state, action, new_state)
: return reward for the given transitionis_end(state)
: check whether at terminal stategamma
/discount
: discount factor for future (not bigger than 1 and not less than 0)
Value Iteration
Value iteration converges quite fast.
- initialize
v
of all statess
to 0 - for iteration
t
in1..
(until convergence or max iteration depth)- for each state
s
- for
a
inactions(s)
- get
q
valueQ(s, a)
- get
- take maximum of all
q
valuesmax_q
- update
v
ofs
withmax_q
- for
- for each state
- return all
v
Q Value
1 |
|
with given state s
and action a
:
- for all possible new state
new_state
- take possibility
T(state, action, new_state)
asp
- take rewards as
r
item
=p * (r + discount * v)
, wherev
is value ofnew_state
- return the sum of all
item
- take possibility
Policy Evaluation
Policy is a function which maps from a state to an action, specifying with given state, what action should be applied.
Use policy evaluation to determine whether a policy pi
is good enough or outperforming other policies.
- initialize
v
of all statess
to 0 - for iteration
t
in1..
(until convergence or max iteration depth)- for each state
s
- action
a
comes form evaluated policypi
- update
v
ofs
withQ(s, a)
- action
- for each state
This is done by cutting down all possible actions in value iteration to one action generated by evaluated policy pi
.
Policy Iteration
By evaluating and updating policy to find the best policy, i.e., all of the actions given by the policy is optimal and have the highest v
value. So does the policy.
- randomly initialize policy
pi
- for iteration
t
in1..
, until convergence (pi
not change) or max iteration depth- get
v
of states by policy evaluation - improve policy
- for
s
in all states- for
a
in all valid actionsq
=Q(s, a)
- get maximum
q
asmax_q
- update policy of
s
toa
withmax_q
- for
- for
- get
References
- Markov Decision Processes 1 - Value Iteration | Stanford CS221: AI (Autumn 2019) - YouTube
- Lecture 17 - MDPs & Value/Policy Iteration | Stanford CS229: Machine Learning Andrew Ng (Autumn2018) - YouTube
- [CS188 FA22] Lecture 7 - MDPs I - YouTube
- cs188-fa22-note07.pdf
- Value Iteration vs. Policy Iteration in Reinforcement Learning | Baeldung on Computer Science
- 4.3 Policy Iteration
- Week 4 Learning - CS50’s Introduction to Artificial Intelligence with Python
A MDP Take-away
https://lingkang.dev/2023/05/07/mdp/