He didn't really explain it, but I think what was going on is the rate limiting is done per account, and the race condition was a way to circumvent that. He has to make all the requests very quickly because the first thing all the requests are doing is determining if new requests for this account should be ignored. All the requests are received around the same time, they all make this check and decide they are valid requests, then they all report that an attempt was made for that account (locking it).
The rate limiting for IPs is probably global (not related to the reset endpoint).
I think you are dead on, yeah it’s the quick rate of large numbers of requests that avoid the per-account rate limiting. Curious how they resolved this— run all authentication requests for a given user serially and in a consolidated fashion at some point. Exclusive lock the relevant db record before checking the code and recording the failure?
Yeah, my first thought was make every attempt acquire some per user lock with a timeout. It's pretty much the same thing. Either one would have a negligible effect on legitimate requests and would solve the problem.
Could start by incrementing the value, then checking if it's below the threshold, similar to an atomic fetch_add operation. PostgreSQL has RETURNING clauses, SQL Server has OUTPUT clauses, etc...
Yeah it is hard. The enforcement would need to be done on a single backend. Not all users need to have their auth done by the same specific backend, but each user individually should always have their auth go to the same backend (or same concurrency domain, if distributed locking applies to the architecture).
The rate limiting for IPs is probably global (not related to the reset endpoint).