The reward function has to be given/predefined and it is fixed. I believe that this task was designend with the implicit Reward Function:

[m]R(s) = 100 if A has won in s[/m]

[m]R(s) = 0 else[/m]

You could argue that this is not explicitly stated in the task, but it is also the most probable / the easiest reward function you could imagine.

Because in the first real iteration, you only look at the initial U(s) in the Bellman equation. As you can see, in the same iteration [m]U’(A_1, B_4)[/m] is set to 0 since it leads to B winning after one look ahead, but for [m]U’(A_2,B_3)[/m] you have to use [m]U(A_1, B_4)[/m], not [m]U’(A_1, B_4)[/m].

Just imagine you would use [m]U’[/m] instead of [m]U[/m], then the order of computation will be crucial simply: nothing will work.

It similar to pagerank algorithm, where you only calculate the new pageranks based on the old ones.