Building AI that can master complex cooperative games with hidden information

bcoates · on Dec 6, 2019

If you're wondering why this is interesting, games that AIs excel at like chess/checkers/go are all two-player, zero-sum, perfect-information (everyone knows everything), deterministic games, so you can exactly predict your opponents behavior by simulating "what would I do if I were them, and trying to make me lose". The only real hard problem in this space is extreme branching factors.

Everything gets vastly more complicated once you break any of those rules: non-zero sum games create a prisoner's dilemma cooperate/defect dynamic, every three or more player game is non-zero sum (and exponentially more for every player you add), hidden information forces you to manage how much you reveal to your opponent and requires you to simulate multiple "alternate futures" based on things you learn after making a decision, and randomness is equivalent to an extra player that makes irrational unpredictable moves.

Games like that are vastly closer to the messy real world than the computationally expensive but near-ideal world of games like go, and they're much more of an open problem.

JoshTriplett · on Dec 6, 2019

In particular, communication is in some ways a complex multiplayer game requiring multi-level modeling of other participants. In the ideal case, it can be modeled as a cooperative game. In suboptimal cases that sadly occur often in the real world, it's a game where you're fractionally cooperating and fractionally competing with every other player, and the degree to which you're cooperating or competing with any given player depends on your model of them, which you update over time based on your observations of how they "play". And it's a helpful shortcut in modeling if you can group expectations of other "players" into common conventions.

Hanabi's multi-level "what is my model of each other player, and what is my model for their model of other players including me" is remarkably deeper than it looks on the surface. Play it for long enough, and you start to handle situations for which preconceived "conventions" don't help: "OK, of the four players at the table, three of them understand certain common conventions, one of them doesn't seem to understand at least one convention based on the misfire they just had/caused (which is also consistent with their low player rating), I can probably assume they don't understand any other conventions commonly considered more challenging than the one they just failed at, so if I give this hint, how will the more advanced players understand it, can I do so without the less advanced player misunderstanding it in a harmful way, and what will happen? And also, for future games, I should remember that this player doesn't know these conventions (yet) until I see evidence that they've improved. I might also consider helping them learn more common conventions. Or, if they don't know enough conventions and don't improve, I might not want to play with them in the future at all."

moconnor · on Dec 7, 2019

Unfortunately Facebook’s approach sidesteps this complexity by ensuring each player uses the same random seed and searches policy based on information they can all see. It’s not really solving the problem as intended in my opinion.

jakobnicolaus · on Dec 7, 2019

We are entirely focused on the self-play setting in which the goal is to learn the highest performing policy for a team of agents all trained together. The Hanabi Challenge also outlines an ad-hoc setting in which you need to adjust to the diverse policies of other agents in the team on the fly.

whatshisface · on Dec 6, 2019

>randomness is equivalent to an extra player that makes irrational unpredictable moves.

Irrational and unpredictable moves are also present in two-player perfect information games.

BubRoss · on Dec 6, 2019

The difference is that in something like chess, you see it happen and adjust. It probably isn't the strongest move and thus probably does more harm than good.

In something like starcraft there can be a lot of moves that can seem irrational because they are only strong if they are undiscovered for too long.

roenxi · on Dec 7, 2019

I just struggle to see the real challenge in this argument. I can see that simulating something more complex than a Go game is a real engineering challenge - simulating a Go game from start to finish is really easy apart from assigning a value function to each move.

That being said:

> hidden information forces you to manage how much you reveal to your opponent

It is hard to see how that could be harder for a learning AI to pick up than any other game action with long-term benefits. It might reveal that AI are so much better at humans in games like Go and Chess that they aren't even exploiting long-term links between cause and effects, but I suspect from the example we've seen with Go that this is a solved problem.

> and requires you to simulate multiple "alternate futures" based on things you learn after making a decision,

That is describing a tree search. Why is that theoretically difficult? I can see it is a real engineering challenge to simulate a complex environment.

> and randomness is equivalent to an extra player that makes irrational unpredictable moves.

Computers handle randomness much better than humans at all levels because they have an actual statistical grounding. They still aren't very good at it, but they sure thrash humans.

I suppose basically I can see why a hidden-information game could be intractable because simulating the environment is so hard that it can't be done and breakthroughs in simulation must be found to train the AI. But I don't see why that is being linked to hidden information and cooperative gameplay. We know humans do fine without utilising hidden information, because they can play these games and they don't know any hidden information.

noambrown · on Dec 6, 2019

Hi! I'm one of the authors on the paper. We'd be happy to answer any questions. Ask us anything!

gjstein · on Dec 6, 2019

Hey Noam, this is some great work; I'll need to sit down and give the paper a deeper read. Also, the visualizations on this blog post are incredible.

I saw a talk on the Libratus agent a while back, and one of the most interesting takeaways was that the behavior of the bot had already started to impact the professional players, who now spontaneously bet large amounts to force other players out of a hand. Were there any behaviors your agent demonstrated that surprised you in the same way? What insights might we draw from this cooperative AI system that may have more general applicability to other planning domains?

noambrown · on Dec 6, 2019

In terms of Hanabi, this bot arrived at conventions that are pretty different from how humans play the game. We invited an advanced Hanabi player to play with the bot and he pointed out a few things in particular that he'd like to start using. For example, humans usually have a rule that if your teammate hints multiple cards of the same color/number, you should play the newest one. The bot uses a more complicated rule: if the card you just picked up was hinted then play that card, otherwise play the oldest hinted card. That gives you way more flexibility to hint playable cards that would otherwise be tough to get played.

I think one important general lesson is that search is really, really important. Deep RL algorithms are making huge advancements, but Deep RL alone can't reach superhuman performance in Go or poker with search. Here, too, we see that search was the key to conquering this game, and I think that will hold true in more complex real-world settings as well. Figuring out how to extend search to more complex real-world settings will be a challenge, but it's one worth pursuing.

JoshTriplett · on Dec 6, 2019

> For example, humans usually have a rule that if your teammate hints multiple cards of the same color/number, you should play the newest one. The bot uses a more complicated rule: if the card you just picked up was hinted then play that card, otherwise play the oldest hinted card. That gives you way more flexibility to hint playable cards that would otherwise be tough to get played.

I've definitely seen advanced Hanabi players use a more subtle version of that rule: "If your hint looks like it's telling me to play my leftmost hinted card, how long has that card been playable? If it could have been hinted for play a long time ago, and it's just being hinted now, it must not be playable. So what else must you mean...?"

That version of the rule allows for more subtle cases. Suppose you hint that a player's second-from-the-left and fourth-from-the-left cards are both red. If there hasn't been an opportunity to hint the second-from-the-left since it became playable, go ahead and play the second-from-the-left. If there have been opportunities to hint second-from-the-left, play fourth-from-the-left.

That rule requires human players to model whether the other players' actions in the interim have been "urgent" things that needed taking care of before hinting them, or whether those other players would have hinted them sooner if their card was playable.

Madmallard · on Dec 6, 2019

Isn't this not the same thing as AI that can beat humans at things like Bridge where the bidding game matters quite a lot? IIRC in Hanabi the fact that there is imperfect information does not really matter that much for strategy, where as in things like League of Legends or Bridge or many of those types of games it really does matter quite a lot.

noambrown · on Dec 7, 2019

Bridge has a similar challenge, though from what I understand Bridge AIs are not superhuman yet. I suspect our techniques could be applied to Bridge, though they may need to be adapted a bit.

The imperfect information in Hanabi absolutely matters a ton. It's not an interesting game without it.

AndrewKemendo · on Dec 6, 2019

Can you come back in a day or so and answer some questions? I, like others, need some time to digest it.

noambrown · on Dec 6, 2019

Definitely!

kevinwang · on Dec 7, 2019

Damn, I haven’t gotten around to fully reading Pluribus yet, and now there’s more? Congrats on the results! What’s next?

noambrown · on Dec 7, 2019

Thanks! We're looking in a few different directions, but one thing I'm excited about is mixed cooperative/competitive settings. In poker, there is no room for cooperation. In Hanabi, you are 100% cooperating with your teammates. But most real-world situations, like a negotiation, are somewhere in between. The AI techniques for these settings are not too strong yet.

carapace · on Dec 6, 2019

What's FB going to do with this?

noambrown · on Dec 6, 2019

Open source it, learn from it, and build upon it to continue to push forward the frontier of AI.

nojvek · on Dec 7, 2019

That would be nice. What is Facebook AIs take on ethical use of its research?

Facebook probably pays through the nose for AI research and probably wants a ROI. Facebook makes money by building better user models and spamming targetted ads. Some of them are getting scarily good.

hooande · on Dec 6, 2019

they (basically) applied the ideas from a bot that plays poker to another game. it's interesting work, though perhaps not groundbreaking.

This idea of selfplay + counterfactual regret minimization does seem to be the superior way to solve game theoretic problems. Identifying valuable game theoretic problems remains a challenge...

noambrown · on Dec 6, 2019

The search algorithm shares a lot in common with our Pluribus poker AI (https://ai.facebook.com/blog/pluribus-first-ai-to-beat-pros-...), but we added "retrospective belief updates" which makes it way more scalable. We also didn't use counterfactual regret minimization (CFR) because in cooperative games you want to be as predictable as possible, whereas CFR helps make you unpredictable in a balanced way (useful in poker).

The most surprising takeaway is just how effective search was. People were viewing Hanabi as a reinforcement learning challenge, but we showed that adding even a simple search algorithm can lead to larger gains than any existing deep RL algorithm could achieve. Of course, search and RL are completely compatible, so you can combine them to get the best of both worlds, but I think a lot of researchers underestimated the value of search.

hooande · on Dec 7, 2019

I just spent three weeks going through your research. Thank you for that work, especially the supplementary materials.I wish I'd known how much the ideas in the pluribus paper depended on reading the libratus paper.

I see what you're saying about the real time search (which took me quite some time to understand). I came up with a way to do that from disk due to memory limitations. It limits the number of search iterations but doesn't seem to have a huge negative impact on quality so far.

Anyway, thanks again!

thefifthsetpin · on Dec 6, 2019

Hanabi definitely has traits involving theory of mind that I've not seen present in other games.

For example, I've played Hanabi with 3+ players where the person before me deliberately gave a misleading hint to the person after me. For example, "this is your only blue card" indicating a blue 5 even though only blue 1-3 have been played. They were counting on me to anticipate that the mislead person was now very likely to waste that valuable card & realize that I could only reasonably avert that misplay by playing a blue 4, which is how I came to realize that I must be holding a blue 4.

Perhaps that depth of theory of mind can be useful in poker, but I must confess that I'm not playing poker at a level where it'd be helpful.

JoshTriplett · on Dec 6, 2019

> For example, I've played Hanabi with 3+ players where the person before me deliberately gave a misleading hint to the person after me. For example, "this is your only blue card" indicating a blue 5 even though only blue 1-3 have been played. They were counting on me to anticipate that the mislead person was now very likely to waste that valuable card & realize that I could only reasonably avert that misplay by playing a blue 4, which is how I came to realize that I must be holding a blue 4.

(In Hanabi convention, that kind of hint is typically called a "finesse".)

Hanabi has a huge amount of depth in modeling other players. (To the point that the most common failure mode in Hanabi is assuming too much about the reasoning another player will do, or overthinking things in "iocaine powder" style, "but you know that, but I know you know that, but ...".) In general, you can often assume your partners can make use of most the information available to them, especially when playing electronically with an interface that makes all that information readily visible, such as "what hints have everyone received and which cards did those hints identify". (As a notable exception, it's relatively uncommon to use "negative information" if it isn't being tracked for you, such as "this card isn't red or blue because you've gotten red and blue hints that didn't point to it".)

One thing this article doesn't talk about in detail is how well the bot manages to make plays that humans can understand when playing alongside it. (The article mentions things that improve play with humans, but doesn't mention how well human players can understand and cooperate with the bot using common conventions. The comments here do talk about that, though!) The article focuses on the bot playing itself.

I would expect that when playing itself the bot can learn to give hints that will result in multiple successful plays; that'll be especially true when expanding to 3+ players. Effectively, the bot can learn to communicate with itself using hints as a channel to provide information. The question then becomes, are those hints trained for maximum information efficiency at the expense of human understanding, or are those hints something that relies on logical reasoning humans can cooperate with?

A group of human players who know they'll only be playing with each other could absolutely develop a set of conventions more efficient and effective than those commonly used for playing with unknown players. Those conventions will then be completely baffling to anyone outside that group, and will result in serious losses when playing with anyone else.

In the end this can become an information-theory issue: how many bits of information does each player need to know what to play, and how efficiently can you communicate that information to all players in time for them to play, using the limited information channels of hints and discards and plays and even intentional misfires. And you might be able to squeeze even more information through that channel if you can Huffman-code it based on the likelihood of possible game states in other player's minds. If you know the other players are copies of you or very similar to you, and are doing similar modeling and assuming others are doing similar modeling, then you can encode the information very densely.

But maximizing the information-theoretic bandwidth of that communication channel doesn't necessarily result in hints that support logical reasoning. That requires modeling other players who aren't necessarily copies of you, and figuring out how they'll understand your hint or play.

High-level Hanabi play absolutely results in complex multi-level models of other players. You don't just need to model the other players, you also need to model how the other players will model others, including you.

Some examples of the kinds of logical reasoning common in a 3+ player Hanabi game, including such multi-level models (assume you're player 1 and the players after you are 2, 3, 4, and so on):

- Finesse: Player 1 hints player 3 or later, with a hint that they could interpret as "play", which would otherwise misfire, so an earlier player knows to play a card in their hand that will work. Also note that with 4+ players, you could give a hint to player 4 or later, and the players in-between have to figure out which of them has the card to make the finesse work. Player 2 will see if any other players have a card that will work, and if so, they'll skip playing (doing something else like hinting instead) and a subsequent player will know they need to play.

- Multiple finesse: Player 1 hints player 4 or later, and multiple players in-between need to play for that hint to work.

- Bluff: Player 1 hints player 3 or later, with a hint that looks like a finesse, knowing that player 2 will think it's a finesse. Player 2 plays their card, only to find that it isn't the card that would make the hinted player not bomb, but it is a different playable card. The hinted player then needs to realize that the bluff happened, and not play the card they were hinted, but remember what card it must be to have looked like a finesse. (For instance, red 1-3 and blue 1 have been played, player 1 hints red to player 3's 5R, player 2 plays their leftmost card thinking it's the playable 4R, but it's actually the playable 2B. Player 3 now knows they have the 5R, and later in the game, once the 4R is played, player 3 plays the 5R without needing further hints.)

- Reverse finesse: Player 1 hints player 2, which might otherwise look like telling player 2 to play. But player 2 sees that player 3 or later has a playable card that it seems like you've hinted player 2 as having. For instance, red 1-3 have been played, you hint player 2's 5R as red, and player 3 has the 4R either as the newest card or as the newest card that's hinted either red or 4. Player 2 realizes that they shouldn't play their R (which they now know to be the 5R) because you are reversing player 3's 4R. Player 3 then realizes that because player 2 knew not to play the 5R, they must have the 4R so they should play it. And on the next time around, player 2 needs to remember that they have the 5R and play it without any further hint from anyone.

jakobnicolaus · on Dec 7, 2019

We haven't yet analyzed the gameplay to look for examples of these well-known human Hanabi conventions. All the code and agents are open-sourced though, so feel free to take a look!

partingshots · on Dec 6, 2019

They should tackle Starcraft II next like DeepMind has with AlphaStar, or at least a similar real time RTS like Starcraft with a fog of war and a partially observable state.

jakobnicolaus · on Dec 7, 2019

There are unique challenges around learning effective communication protocols that appear in cooperative settings, which was the focus of this work. Getting robust superhuman performance in SC2 remains an interesting challenge, though.