Agent Strategy Thread

Disclaimer: Dieser Thread wurde aus dem alten Forum importiert. Daher werden eventuell nicht alle Formatierungen richtig angezeigt. Der ursprüngliche Thread beginnt im zweiten Post dieses Threads.

Agent Strategy Thread
Stein3000 by Tobias Völk (tutor)

AlphaBeta, IDS, move sorting

Heuristic: “checkmate” if more than half of all seeds are in a store or the board is empty, else store difference
(last year: evaluation trained with genetic algorithm for each board size, while superior on average, they made some very dumb mistakes → removed just to be safe)

Quiescene search: Only stops searching when there are no more captures

Endgame databases for board sizes 4 to 12 of all positions having less than a certain number of stones

Plays Kalah (4,4) perfectly with 0.5s per move on my old desktop.
Curiously only second place in the (4,4) round, probably because it wasn’t that good at tricking weaker agents into loosing when playing north.
Apparently playing perfectly is not the best strategy ^^

EDIT: What the heck, I knew that the StoreDifference isn’t the best heuristic, but I didn’t expect my agent to fail that hard at board sizes 5 and 6 :-/

Excited to hear about your agents :slight_smile:


HALAK, Horrible Agent Loosing At Kalah (source)

General Idea: Boring Alpha Beta Pruning via IDS with Transposition Tables, and plenty of premature optimization.

Didn’t preform well, I’m guessing I wrote a [m]<[/m] where a [m]>[/m] would have been the right thing (or vice versa).

The idea behind the transposition tables was to cache the evaluations of previous searches, to accelerate both further IDS iterations, and the next rounds. HALAK-n would have maintained a table for the next n moves. To avoid a fatal [m]OutOfMemoryError[/m], a second thread uses Java’s [m]MemoryUsage[/m] introspection to calculate how much of the heap is currently being used, and trimming the memory if necessary. But because the fundamental search was broken, the transposition table didn’t help, so the final version was just “HALAK-0”.

It was an absolute pain to write and test and my utter hatred for Java has just been intensified.


Angela

Starts with an excessive analysis of the current board and is able to return a really good next step within a couple milliseconds.
As we didn’t know how much time the game would give us to move, we thought we might need it before the start of the real algorithm.
the firstsearch wins on boards <6 almost always against an ITD with depth <=3.

After this we start an ITD starting at depth 3, pruning when possible and improving the first found solution.
heuristic used is the score difference between both players.

Tried doing it in Scala first, but a lack of motivation to learn the language made us switch back to Java.
Overall we expected much bigger boards then 4 and a very limited time.


AgentDoubleCheck.

Just alpha beta search with looking at double moves that dont destroy capture moves first. Then capture moves then other double moves. Prefer the house at the right end if nothing else seems good.

No opening no endgame starts. No use of the init time to start calculations. No cheating on small boards :P.
Use ITD to update the move variable.


UniqueNameForAnAgent

Uses magic. To be more precise: Single-threaded Alpha-Beta search with Iterative Deepening :smiley:

The evaluation function:

myStore * 9 + sumOfAllOfMyHouses - enemyStore * 9 - sumOfAllEnemyHouses

It sorts the moves by the evaluation function (and looks at the “best” first).
I am not sure how the others did it, but we also had to build our own implementation of Kalah itself, so that we can try out moves.
It always uses the total time. It never stops until it gets killed.

How we implemented it:

(All of the following Agents are originally just copies of their predecessors, then got improved at some points.)

  1. We built a MiniMax Agent. This unfortunately did not even win reliably against the Random Agent. Reason: It would go down the very first descendant of the root node and get lost there (potentially since there are some moves that can lead to a super long game, resulting in a pretty deep tree. Not sure about this tho.).

  2. MiniMaxWithIterativeDeepening. Reliably bet both the Random Agent and the MiniMax Agent.

Evaluation function:

myStore * 3 + sumOfAllOfMyHouses - enemyStore * 3 - sumOfAllEnemyHouses

(The “3” was just an intuition.)

  1. AlphaBeta. Reliably bet all of the previous Agents. Only sometimes lost in short games against MiniMaxWithIterativeDeepening, where the latter could look to the end of the game. (No way to fix this, since MiniMaxWithIterativeDeepening plays optimal in such cases.)

  2. AlphaBeta with Move Sorting. Reliably bet all Agents (except for some rare cases like short games)

  3. Now it was time to figure out if the “3” from our evaluation function was good or not. So: Let Agents with an evaluation function using “1”, “3”, “5”, “7” and “9” play against each other. Looked like “5” and “9” were the best options → Let them play against each other → “9” won.

Possible improvements we could have made (if we had had more free time :D):

  • Figure out if “9” is really the best number. We haven’t checked larger ones.
  • Figure out a better overall heuristic
  • Make it multi-threaded?
  • Use the initialization time to do some computations.

Evolved Agent
Evolved Agent uses iterative deepening alpha-beta search. Additionally, it uses a HashSet where previously explored states are saved. This helps specifically in the end-stages of large boards where many moves transitively lead to the same state.

The heuristic is learned end-to-end via Covariance Matrix Adaptation evolutionary strategies (CMA-ES) which adapts the covariance matrix, the std, and mean of a population of agents. I optimize for MSE between the predicted end result and the actual end result of rollouts generated by self-play. (i.e. (predictedValue - actualValue)^2) Directly optimizing the policy via self-play could work, but has the disadvantage of not being able to use a replay buffer.

The reason why it performs so bad, is (presumably) because I have a bug in my transitions simulation: Instead of adding the captured Value to the house I set it to the house. I only found this after the deadline :cry:

If I fix that behavior and retrain the policy it would probably perform better, though I don’t know by how much: maybe evaluating the 2-Layer MLP heuristics function is so expensive that I pay too much for the better heuristic.