they (basically) applied the ideas from a bot that plays poker to another game. it's interesting work, though perhaps not groundbreaking.
This idea of selfplay + counterfactual regret minimization does seem to be the superior way to solve game theoretic problems. Identifying valuable game theoretic problems remains a challenge...
The search algorithm shares a lot in common with our Pluribus poker AI (https://ai.facebook.com/blog/pluribus-first-ai-to-beat-pros-...), but we added "retrospective belief updates" which makes it way more scalable. We also didn't use counterfactual regret minimization (CFR) because in cooperative games you want to be as predictable as possible, whereas CFR helps make you unpredictable in a balanced way (useful in poker).
The most surprising takeaway is just how effective search was. People were viewing Hanabi as a reinforcement learning challenge, but we showed that adding even a simple search algorithm can lead to larger gains than any existing deep RL algorithm could achieve. Of course, search and RL are completely compatible, so you can combine them to get the best of both worlds, but I think a lot of researchers underestimated the value of search.
I just spent three weeks going through your research. Thank you for that work, especially the supplementary materials.I wish I'd known how much the ideas in the pluribus paper depended on reading the libratus paper.
I see what you're saying about the real time search (which took me quite some time to understand). I came up with a way to do that from disk due to memory limitations. It limits the number of search iterations but doesn't seem to have a huge negative impact on quality so far.
Hanabi definitely has traits involving theory of mind that I've not seen present in other games.
For example, I've played Hanabi with 3+ players where the person before me deliberately gave a misleading hint to the person after me. For example, "this is your only blue card" indicating a blue 5 even though only blue 1-3 have been played. They were counting on me to anticipate that the mislead person was now very likely to waste that valuable card & realize that I could only reasonably avert that misplay by playing a blue 4, which is how I came to realize that I must be holding a blue 4.
Perhaps that depth of theory of mind can be useful in poker, but I must confess that I'm not playing poker at a level where it'd be helpful.
> For example, I've played Hanabi with 3+ players where the person before me deliberately gave a misleading hint to the person after me. For example, "this is your only blue card" indicating a blue 5 even though only blue 1-3 have been played. They were counting on me to anticipate that the mislead person was now very likely to waste that valuable card & realize that I could only reasonably avert that misplay by playing a blue 4, which is how I came to realize that I must be holding a blue 4.
(In Hanabi convention, that kind of hint is typically called a "finesse".)
Hanabi has a huge amount of depth in modeling other players. (To the point that the most common failure mode in Hanabi is assuming too much about the reasoning another player will do, or overthinking things in "iocaine powder" style, "but you know that, but I know you know that, but ...".) In general, you can often assume your partners can make use of most the information available to them, especially when playing electronically with an interface that makes all that information readily visible, such as "what hints have everyone received and which cards did those hints identify". (As a notable exception, it's relatively uncommon to use "negative information" if it isn't being tracked for you, such as "this card isn't red or blue because you've gotten red and blue hints that didn't point to it".)
One thing this article doesn't talk about in detail is how well the bot manages to make plays that humans can understand when playing alongside it. (The article mentions things that improve play with humans, but doesn't mention how well human players can understand and cooperate with the bot using common conventions. The comments here do talk about that, though!) The article focuses on the bot playing itself.
I would expect that when playing itself the bot can learn to give hints that will result in multiple successful plays; that'll be especially true when expanding to 3+ players. Effectively, the bot can learn to communicate with itself using hints as a channel to provide information. The question then becomes, are those hints trained for maximum information efficiency at the expense of human understanding, or are those hints something that relies on logical reasoning humans can cooperate with?
A group of human players who know they'll only be playing with each other could absolutely develop a set of conventions more efficient and effective than those commonly used for playing with unknown players. Those conventions will then be completely baffling to anyone outside that group, and will result in serious losses when playing with anyone else.
In the end this can become an information-theory issue: how many bits of information does each player need to know what to play, and how efficiently can you communicate that information to all players in time for them to play, using the limited information channels of hints and discards and plays and even intentional misfires. And you might be able to squeeze even more information through that channel if you can Huffman-code it based on the likelihood of possible game states in other player's minds. If you know the other players are copies of you or very similar to you, and are doing similar modeling and assuming others are doing similar modeling, then you can encode the information very densely.
But maximizing the information-theoretic bandwidth of that communication channel doesn't necessarily result in hints that support logical reasoning. That requires modeling other players who aren't necessarily copies of you, and figuring out how they'll understand your hint or play.
High-level Hanabi play absolutely results in complex multi-level models of other players. You don't just need to model the other players, you also need to model how the other players will model others, including you.
Some examples of the kinds of logical reasoning common in a 3+ player Hanabi game, including such multi-level models (assume you're player 1 and the players after you are 2, 3, 4, and so on):
- Finesse: Player 1 hints player 3 or later, with a hint that they could interpret as "play", which would otherwise misfire, so an earlier player knows to play a card in their hand that will work. Also note that with 4+ players, you could give a hint to player 4 or later, and the players in-between have to figure out which of them has the card to make the finesse work. Player 2 will see if any other players have a card that will work, and if so, they'll skip playing (doing something else like hinting instead) and a subsequent player will know they need to play.
- Multiple finesse: Player 1 hints player 4 or later, and multiple players in-between need to play for that hint to work.
- Bluff: Player 1 hints player 3 or later, with a hint that looks like a finesse, knowing that player 2 will think it's a finesse. Player 2 plays their card, only to find that it isn't the card that would make the hinted player not bomb, but it is a different playable card. The hinted player then needs to realize that the bluff happened, and not play the card they were hinted, but remember what card it must be to have looked like a finesse. (For instance, red 1-3 and blue 1 have been played, player 1 hints red to player 3's 5R, player 2 plays their leftmost card thinking it's the playable 4R, but it's actually the playable 2B. Player 3 now knows they have the 5R, and later in the game, once the 4R is played, player 3 plays the 5R without needing further hints.)
- Reverse finesse: Player 1 hints player 2, which might otherwise look like telling player 2 to play. But player 2 sees that player 3 or later has a playable card that it seems like you've hinted player 2 as having. For instance, red 1-3 have been played, you hint player 2's 5R as red, and player 3 has the 4R either as the newest card or as the newest card that's hinted either red or 4. Player 2 realizes that they shouldn't play their R (which they now know to be the 5R) because you are reversing player 3's 4R. Player 3 then realizes that because player 2 knew not to play the 5R, they must have the 4R so they should play it. And on the next time around, player 2 needs to remember that they have the 5R and play it without any further hint from anyone.
We haven't yet analyzed the gameplay to look for examples of these well-known human Hanabi conventions. All the code and agents are open-sourced though, so feel free to take a look!
This idea of selfplay + counterfactual regret minimization does seem to be the superior way to solve game theoretic problems. Identifying valuable game theoretic problems remains a challenge...