Disclaimer: Dieser Thread wurde aus dem alten Forum importiert. Daher werden eventuell nicht alle Formatierungen richtig angezeigt. Der ursprüngliche Thread beginnt im zweiten Post dieses Threads.

**Problem 7.1.5**

Es wird in der Aufgabenstellung gesagt, dass nur die States relevant sind, in denen A am Zug ist.

Am Zug sein bedeutet ja, dass das Spiel noch nicht vorbei ist (sonst ist ja keiner am Zug).

Jetzt habe ich zwei Probleme:

- Ich habe einen State s, in dem A am Zug ist und entweder gewinnen kann oder einen anderen Zug machen kann, der nicht direkt gewinnt. Hier habe ich angenommen, dass R(s) = 100. Irgendwo muss ich diese Zahl ja einbauen und da jeder State in dem A schon gewonnen hat, eigentlich ja gar keiner ist, dachte ich das muss hier sein. Ist halt aber schon sehr hingecheatet. Kann man das so machen?
- Ich habe einen State s, von dem aus A eine Aktion ausführen kann, woraufhin B entweder gewinnen kann (Fall 1), oder in einen State s’ übergehen kann, von dem aus das Spiel weitergeht (Fall 2). In Fall 1 könnte man es zwar so modellieren, dass wir einen State haben. Aber selbst dann muss ich das Maximum über eine leere Menge ziehen (es gibt keine Aktionen mehr) und das macht ja keinen Sinn. Wenn das kein State ist, habe ich allerdings keine Utility, die ich für die Berechnung in s verwenden kann. Was tue ich hier? Ich muss das ganze ja auch noch mit Fall 2 verrechnen, wenn ich Fall 1 einfach weglasse, “merkt” der Algorithmus ja nicht, dass s ein schlechter Zustand ist.

Danke :huh:

I’m not sure if I got this correctly but one “round” [m]T(s, a, s’)[/m] is defined as the probability of [m]A[/m] moving action [m]a[/m] followed by [m]B[/m]‘s move. After both moves we are in state [m]s’[/m] and can update the utility because it’s [m]A[/m]'s move again. In my opinion, this is what’s meant with “game states, where it’s [m]A[/m]'s move”. In addition, reward should be given only in finite states such as (4,3) or (3,1) and zero else. But I’m not a hundred percent sure about this. Hope this helps - otherwise you should wait for @Jazzpirate’s answer

This *specifically includes terminal states* of course! „A won“ and „B won“ are terminal states. Sorry if that wasn’t clear.

No, this is obviously not allowed. If A hasn’t won yet, they don’t get their reward.

Or in other words:

Exactly, thank you

Can we assume reward as zero in non-terminal states?

Damn, yes, that I definitely should have mentioned!

For the drawing exercice: I assume we only need to do reachable states and once a terminal state was reached no more actions are performed, correct?

For 5. I’m not really sure about the wording. Can I set gamma (e.g. to 0.99) or do I need to do this for a “general” gamma or should I do it for different gammas possibly leading to different strategies?

Correct

The second one. Just take an undetermined variable \gamma - ultimately it doesn’t matter; the strategy should be the same in all cases, which is why we can just leave gamma unspecified.

By now I still have this problem. I get that (3,1) should be a terminal state but there are no actions, so what is the maximum over all possible actions? An empty set doesn’t have a maximum. If I assume that the result is zero, what practically makes sense because we have only non-negative values, I get the problem that I have to divide by zero („Now assume that your opponent picks a move with probability inversely proportional to our utility of the resulting state“).

My idea was now to introduce an epsilon greater zero that should be the maximum of an empty set. Is there any better way to do that?

What I did was assuming self-loops in terminal states and calculating the utility with [m]s’ = s[/m] which works fine for my defined Bellman equation. Finally, that leads to multiplying with zero and therefore having [m]U(s) = R(s)[/m] for all [m]s ∈ F[/m].

I already did that but my problem are the parents of terminal states. My probability T in the Bellman-Equation contains 1/U(s) for every successor state. I don’t see how to prevent this due to the inversely proportionality and so I have to divide by zero. That’s why I wanted to introduce an epsilon.

Maybe you should assume T as 1 if there is only one reachable state.

Yes, it does - the maximum over an order L of the empty set is the minimum of L. Hence the maximum is 0; which nicely amounts to „the utility of a terminal node is exactly its reward“ - which makes sense if you think about it

Hmm, this doesn’t happen to me, but then again:

This doesn’t happen for me either; mine only contains 1/(the sum of the utilities for all successor states)

Note that by saying “probability inversely proportional to our utility of the resulting state” I don’t mean that there is a fixed proportionality factor between all utilities and their probabilities. What I meant to say is simply “the higher the utility, the lesser the probability”.

In either case; you may safely assume that the probability of a set of states {s_1,…s_n} of which U(s_1)=…=U(s_k)=0 is 1/k for each s_i up until k and 0 for all others. i.e. the opponent always picks a state with Utility 0, and if there’s more than one all of those with equal probability. You’ll note that this satisfies a) the proportionality requirement under b) the additional assumption that 0 is as bad as it gets and is infinitely worse than every other utility.

Kinda hacky, but I still assume that there’s an “error” in your Bellman-equation somehow, I understand if you don’t want to throw everything away or spend hours looking for how your Bellman-equation might be wrong, and this should get you unstuck without violating the spirit of the exercise.

DAMN, but I just noticed that I’M AN IDIOT.

This is incredibly misleading; since this is only the case if s1 and s2 are the only successors >.<

Sorry, I thought the way I put it would clearly imply one unique way to go about doing things, which it clearly did not. Even more so, I accidentally very much suggested a different way to model this than I intended. See previous comment for how to continue if you encounter a division by zero.

Could you please give another valid example, that makes clear which function you intended? If that is feasible consider the utilities 30, 20, 10 for 3 successor states.

What I’d say (not validated yet): what you have to do is calculating the relation between the utilities and inverse them.

That means, if:

- [m]U(s_1) = 10[/m],
- [m]U(s_2) = 2 * U(s_1) = 20[/m] and
- [m]U(s_3) = 3 * U(s_1) = 30[/m] for player [m]A[/m]

then for player [m]B[/m] it is twice as probable to prefer [m]s_1[/m] to [m]s_2[/m] and three times more probable to prefer [m]s_1[/m] to [m]s_3[/m]. Of course [m]B[/m] is also 1.5 times as likely to pick [m]s_2[/m] than [m]s_3[/m]. Hope that hint could help you.

Actually, what I intended was simply $P(s)=1-(U(s)/SUM(U(s’)))$ (where s’ ranges over all successors), but any probability assignment where the enemy is less likely to pick a successor with high utility is fine

The solution from assignments.pdf says “we initialize wit [m]U(s) = 50[/m] for all states” but then during the update of state [m](A_2, B_3)[/m] (second bullet) it seems that the current reward [m]R(s)[/m] for this state is zero where I excepted to use 50. So specifically what’s the difference between utility and reward function?

Edit: And why is B’s probability of moving right to [m](A_1, B_4)[/m] the same as moving left to [m](A_1, B_2)[/m], where he’d won instantly?