Their first problem was sending customer requests directly into processing. They're collecting customer metrics - they should never be unable to handle a request because another request has consumed all the resources.
A much better way to do this would be to have a very lightweight API putting events into a stream (Kafka or, since they seems to be on AWS, Kinesis [1]). Let the stream absorb that crazy customer data, and let your data processing run full-speed from the stream. You'll get blips and slow downs, but it won't affect your ability to receive more data. Log out any errors or malformed data so that customers can see the problems. Do your profiling and optimisation, but avoid losing data.
[1] We're using Kinesis. Very easy to provision, does what it says on the box, and can easily handle thousands of requests per second.
A much better way to do this would be to have a very lightweight API putting events into a stream (Kafka or, since they seems to be on AWS, Kinesis [1]). Let the stream absorb that crazy customer data, and let your data processing run full-speed from the stream. You'll get blips and slow downs, but it won't affect your ability to receive more data. Log out any errors or malformed data so that customers can see the problems. Do your profiling and optimisation, but avoid losing data.
[1] We're using Kinesis. Very easy to provision, does what it says on the box, and can easily handle thousands of requests per second.