Over 500PB of data, wow. Would love to know how and why "statistical models that produce price forecasts for over 50,000 financial instruments worldwide" require that much storage.
Message Volume 1,684,103,265
Messages per Second 1,134,640
Order Volume 871,875,595
Orders per Second 581,696
Share Volume 12,814,454,760
Executions per Second 193,350
Also if you look at equity derivative products which have parameters like type call/put, strike, maturity can be hundreds of financial products for one underlying stock.
I worked in this sector and volume of data is a real challenge, no wonder you often get custom software to handle that :)
How do you propose lossless compression for all orderbook data? Of course if you are willing to lose granularity/information, it can be compressed a lot
I would imagine to lesser extent government policy changes and news articles, and to larger extent online discussions on topics relevant to these instruments. Models then attempt to extract signals with predictive value from all the noise. Probably contains non-trivial amount of history to correlate words to market performance in the past, say 20 years or more.
But it's really just a guess, I haven't worked in this domain.