With the 'cattle not pets' mindset that pervades modern development is the lifespan of ephemeral cache VMs that closely monitored? They get spun up and down on demand in most architectures. I can see this being an edge case failure when the system is trying to scale up, the existing VMs are getting absolutely hammered, the hypervisor is trying to start up new ones, memory pressure and iops on the existing ones are maxed out...
It just seems like the most obvious root cause to me, a single bit-flip in a hashed value is going to give you the wrong result data without any other error because the hash value is already essentially heavily compressed, meanwhile the hash table is almost certain to be 100% stored in memory and very heavily accessed from multiple directions in a read/write manner.