Published on

Brunskill - Lecture 2

Lecture Link

Assumption: The dynamics model and the reward model (together describe how the world works) are given (to whom? to the agents?) .

Markov Reward Process Continues

Markov property

In the lecture, the following this shown:

V(s)=R(s)+γsPr(ss)V(s)V(s) = R(s) + \gamma \cdot \sum_{s'} Pr(s' | s) V(s')

which is a result of the Markov assumption. Here we provide a detailed derivation. Recall that GtG_t, sts_t, rtr_t are all random variables.

V(s)=E[Gtst=s]=E[rt+γGt+1st=s]=E[rtst=s]+γE[Gt+1st=s]\begin{align*} V(s) &= \mathbb{E}[G_t | s_t = s] \\ &= \mathbb{E}[r_t + \gamma \cdot G_{t+1} | s_t = s]\\ &= \mathbb{E}[r_t | s_t = s] + \gamma \cdot \mathbb{E}[G_{t+1} | s_t = s]\\ \end{align*}

Here, R(s)R(s) is defined as E[rtst=s]\mathbb{E}[r_t | s_t = s] (in the previous lecture), hence the first term. The second term E[Gt+1st=s]\mathbb{E}[G_{t+1} | s_t = s] can be further expressed as:

E[Gt+1st=s]=E[  E[Gt+1st+1]  st=s]\begin{equation*} \mathbb{E}[G_{t+1} | s_t = s] = \mathbb{E}[\; \mathbb{E}[G_{t+1} | s_{t+1}] \; | s_t = s] \end{equation*}

Markov property is used here to obtain Equation above. In general, E[XZ=z]E[  E[XY]  Z=z]\mathbb{E}[X | Z = z] \neq \mathbb{E}[\; \mathbb{E}[X | Y] \; | Z = z], where the equality only holds when X is conditionally independent of Z given Y. This is indeed true in our case, where the return at t+1t+1 is independent to the state at tt given a state at t+1t+1, due to the Markov assumption. It then follows that:

E[  E[Gt+1st+1]  st=s]=sE[Gt+1st+1=s]Pr(st+1=sst=s)=sV(s)Pr(st+1=sst=s)\begin{align*} \mathbb{E}[\;\mathbb{E}[G_{t+1} | s_{t+1}] \; | s_t = s] &= \sum_{s'} \mathbb{E}[G_{t+1} | s_{t+1} = s'] \cdot Pr(s_{t+1} = s' | s_t = s)\\ &= \sum_{s'} V(s') \cdot Pr(s_{t+1} = s' | s_t = s) \end{align*}

Compute the values

Given V(s)=R(s)+γsPr(ss)V(s)V(s) = R(s) + \gamma \cdot \sum_{s'} Pr(s' | s) V(s'), if we know the values of R(s)R(s) for all states ss and the transition matrix PP, then one can solve the valuation as follows

V=(IγP)1RV = (I - \gamma P)^{-1} R

where (IγP)(I - \gamma P) is invertible since as PP is a stochastic matrix, and γ<1\gamma < 1.

One can also use iterative method by (i)(i) initiating V0(s)V_0(s) (say, to 00) for all states ss, and then iteratively update the valuation Vk(s)=R(s)+γsPr(ss)Vk1(s)V_k(s) = R(s) + \gamma \cdot \sum_{s'} Pr(s' | s) V_{k-1}(s'), k=1,...,k = 1, ..., until convergence.