Ask Question
6 March, 13:18

Show how am MDP with a reward function R (s, a, s') can be transformed into a different MDP with reward function R (s, a), such that optimal policies in the new MDP correspond exactly to optimal policies in the original MDP

+2
Answers (1)
  1. 6 March, 13:38
    0
    U (s) = maxa[R0

    (s, a) + γ

    1

    2

    P

    pre T

    0

    (s, a, pre) (maxb[R0

    (pre, b) + γ

    1

    2

    P

    s

    0 T

    0

    (pre, b, s0

    ) ∗ U (s

    0

    )) ]]

    U (s) = maxa[

    P

    s

    0 T (s, a, s0

    ) (R (s, a, s0

    ) + γU (s

    0

    ) ]

    U (s) = R0

    (s) + γ

    1

    2 maxa[

    P

    post T

    0

    (s, a, post) (R0

    (post) + γ

    1

    2 maxb[

    P

    s

    0 T

    0

    (post, b, s0

    ) U (s

    0

    )) ]]

    U (s) = maxa[R (s, a) + γ

    P

    s

    0 T (s, a, s0

    ) U (s

    0

    ) ]

    Explanation:

    MDPs

    MDPs can formulated with a reward function R (s), R (s, a) that depends on the action taken or R (s, a, s') that

    depends on the action taken and outcome state.

    To Show how am MDP with a reward function R (s, a, s') can be transformed into a different MDP with reward

    function R (s, a), such that optimal policies in the new MDP correspond exactly to optimal policies in the

    original MDP.

    One solution is to define a 'pre-state' pre (s, a, s') for every s, a, s' such that executing a in s leads not to s'

    but to pre (s, a, s'). From the pre-state there is only one action b that always leads to s'. Let the new MDP

    have transition T', reward R', and discount γ

    0

    .

    T

    0

    (s, a, pre (s, a, s0

    )) = T (s, a, s0

    )

    T

    0

    (pre (s, a, s0

    ), b, s0

    ) = 1

    R0

    (s, a) = 0

    R0

    (pre (s, a, s0

    ), b) = γ

    - 1

    2 R (s, a, s0

    )

    γ

    0 = γ

    1

    2

    Then, using pre as shorthand for pre (s, a, s'):

    U (s) = maxa[R0

    (s, a) + γ

    1

    2

    P

    pre T

    0

    (s, a, pre) (maxb[R0

    (pre, b) + γ

    1

    2

    P

    s

    0 T

    0

    (pre, b, s0

    ) ∗ U (s

    0

    )) ]]

    U (s) = maxa[

    P

    s

    0 T (s, a, s0

    ) (R (s, a, s0

    ) + γU (s

    0

    ) ]

    Now do the same to convert MDPs with R (s, a) into MDPs with R (s).

    Similar to part (c), create a state post (s, a) for every s, a such that

    T

    0

    (s, a, post (s, a, s0

    )) = 1

    T

    0

    (post (s, a, s0

    ), b, s0

    ) = T (s, a, s0

    )

    R0

    (s) = 0

    R0

    (post (s, a, s0

    )) = γ

    - 1

    2 R (s, a)

    γ

    0 = γ

    1

    2

    Then, using post as shorthand for post (s, a, s'):

    U (s) = R0

    (s) + γ

    1

    2 maxa[

    P

    post T

    0

    (s, a, post) (R0

    (post) + γ

    1

    2 maxb[

    P

    s

    0 T

    0

    (post, b, s0

    ) U (s

    0

    )) ]]

    U (s) = maxa[R (s, a) + γ

    P

    s

    0 T (s, a, s0

    ) U (s

    0

    ) ]

    3
Know the Answer?
Not Sure About the Answer?
Find an answer to your question 👍 “Show how am MDP with a reward function R (s, a, s') can be transformed into a different MDP with reward function R (s, a), such that ...” in 📗 Engineering if the answers seem to be not correct or there’s no answer. Try a smart search to find answers to similar questions.
Search for Other Answers