Lecture Link
Assumption: The dynamics model and the reward model (together describe how the world works) are given (to whom? to the agents?) .
In the lecture, the following this shown:
V(s)=R(s)+γ⋅s′∑Pr(s′∣s)V(s′)which is a result of the Markov assumption. Here we provide a detailed derivation. Recall that Gt, st, rt are all random variables.
V(s)=E[Gt∣st=s]=E[rt+γ⋅Gt+1∣st=s]=E[rt∣st=s]+γ⋅E[Gt+1∣st=s]Here, R(s) is defined as E[rt∣st=s] (in the previous lecture), hence the first term. The second term E[Gt+1∣st=s] can be further expressed as:
E[Gt+1∣st=s]=E[E[Gt+1∣st+1]∣st=s]Markov property is used here to obtain Equation above. In general, E[X∣Z=z]=E[E[X∣Y]∣Z=z], where the equality only holds when X is conditionally independent of Z given Y. This is indeed true in our case, where the return at t+1 is independent to the state at t given a state at t+1, due to the Markov assumption. It then follows that:
E[E[Gt+1∣st+1]∣st=s]=s′∑E[Gt+1∣st+1=s′]⋅Pr(st+1=s′∣st=s)=s′∑V(s′)⋅Pr(st+1=s′∣st=s)Given V(s)=R(s)+γ⋅∑s′Pr(s′∣s)V(s′), if we know the values of R(s) for all states s and the transition matrix P, then one can solve the valuation as follows
V=(I−γP)−1Rwhere (I−γP) is invertible since as P is a stochastic matrix, and γ<1.
One can also use iterative method by (i) initiating V0(s) (say, to 0) for all states s, and then iteratively update the valuation Vk(s)=R(s)+γ⋅∑s′Pr(s′∣s)Vk−1(s′), k=1,..., until convergence.