Nasdaq ITCH

ggruschow · on Aug 21, 2011

Some things to think about:

  - 50mb is >> total cache on most systems you'll use.
  - This is far from the only function you need to optimize.
  --> The best performing algorithm in a micro-benchmark can be sub-optimal in real use.

  - You've got a fixed-for-the-day list of <10k <9 letter unique upper-case strings
  - The bulk of them you'll process are <= 4 letters (CMCSA/CMCSK be damned).
  - The median is 3 and the mean is <3.6.
  - The symbols aren't anywhere close to uniformly active
  --> Benchmark with a typical data stream, not a uniformly distributed one.

  - You're likely not even considering trading most of them.

  - How much better is your algorithm than the obvious binary-search?
  - Or worse -- plain'ol front-to-back linear search?
  - Are you sure you understand 100% of how the computer works?
    --> Yes? Mail a check to Goldman and find a better line of work.
    --> No? Test it with a quick benchmark.

  - Is binary search cache friendly?
  - Why not? Plot out the memory accesses if need be.

  - What's wrong with taking a linear-interpolated guess?
  - Is that cache friendly? .. or could it be?
  - When your guess misses, what's that cost?
  - Can you think of a cheap modification to lower that cost?
  - Is the distribution really linear?

  - What's wrong with a "perfect hash function"?
  - Does it need to be perfect?

Incidentally, hardware wins at this function. You can just generate logic to do all the if symbol == "XYZ": return 3's in parallel.

tptacek · on Aug 21, 2011

This is a crazy smart comment. Thanks.

... is anyone really using hardware to do this, by the way?

Hardware regexing at this scale is practically a COTS part, or was when I stopped doing network security products in '05.

veyron · on Aug 22, 2011

Unfortunately it's not that simple anymore. You actually have to do most of the order-book-building legwork to figure out if the quote is relevant: every order modification message, such as canceling an order or execution, omits the ticker. Makes sense: after all, the add-quote actually tells you the ticker, so there's no need to repeat that info.

This is where it gets tricky. FPGA based systems don't really give an edge here, and tilera-based solutions are still nascent

aortega · on Aug 22, 2011

This craziness is what happens when you try to implement a specialized algorithm on a general-purpose state-machine. What stupid thing is a "cache" anyway, but an optimization because we are not using real computers but hugely complex state-machines called CPUs. FPGAs are as close as a real computer as we can have today.

tptacek · on Aug 22, 2011

The parallels to high end network security are again interesting here; for instance, on the more interesting network processors, memory accesses were asynchronous, and instead of a "cache" you had a manually addressed "scratch" memory with latency comparable to registers.

... because "general purpose cache" wasn't helpful, but "exploitable locality" definitely was.

wladimir · on Aug 22, 2011

That's pretty close to how the first NVIDIA GPGPU (G80) worked. There was no cache, but only different kinds of memory, of which one was as fast as registers. Optimal access to RAM memory was possible with a limited set of access patterns, so "caching" had to be managed carefully manually.

In later GPUs they (mainly due to programmer laziness :-) ) switched to a more GPU-like automatically managed general purpose cache. The problem was that in most cases, being able to quickly write code that is reasonably fast was more important to developers than being able to squish out every cycle (though I believe it's still possible to disable the cache and manage it manually...)

The problem with automatic cache is the same as all "intelligent" CPU features, the CPU tries to predict what the program is doing, the programmer has to predict what the CPU will predict to optimize for that CPU. In the next generation of the CPU, the CPU vendor will try to optimize again for programs currently around. In the end, optimizing becomes much more complex.

There's certainly an advantage to keeping the logic "dumb" and simply having the software manage everything.

wglb · on Aug 22, 2011

veyron points this out: http://www.fixnetix.com/articles/display/129/

Personally, I think specialized hardware is a bad idea.

ssp · on Aug 22, 2011

- The bulk of them you'll process are <= 4 letters (CMCSA/CMCSK be damned).

With 5 bits per letter, this means the bulk of stock tickers could be stored in 20 bits, so the data structure could just be one big 128 MB array, if the size of the structure was rounded up from 109 to 128 bytes.

wglb · on Aug 22, 2011

This sounds like the beginning of a fun project. However, may I humbly suggest a few things?

(In 2005, I built a system to do interesting computations on each stock option tick. The volume for the time was pretty high. Each time the IBM stock, for example, moved a penny, this generated 400 quotes. While small by today's volumes, 100k quotes with an OAS calculation with a binomial matrix on each tick was a challenge. So I am not totally unfamiliar with these throughput issues.)

And Python was used, but only on the GUI side, presenting the results. While I didn't measure it, I feel quite confident in saying that Python isn't going to keep up. At the risk of being a premature optimizer.

Before you optimize the stock look-up function, you might want to look ahead a bit. veyron has commented below or elsewhere that a very important calculation is going to be your total holdings on each tick. This is how you will calculate your risk, or at least the fundamental start.

Another thing to consider is your goal in trading. How will you be making money? For example, will you be buying/selling on, say, 10 second intervals and collect profits from that? Or longer intervals? Shorter?

Whatever your overall strategy function is, you will spend some time in that calculation. Will you do it on every tick? This computation time might well at this point be unknown. If it is an interesting calculation, it might well be complex.

So how is a startup going to deal with this firehose of a problem?

Well, consider splitting up the processing. One obvious way to do this is to filter tickers you are interested in in one box and push a much more manageable stream to your main logic box. Perhaps start with one frequently-traded stock, tune your algorithms. This will be a rich learning experience.

Then, watch the CPU percentage. If it is above 60% at 0910 central, maybe start profiling your code. Your time-consuming parts will surprise you.

If it is going well, and you want to expand, consider adding one box for each stock that you plan to trade. You can do it for about $200 diskless, quantity one retail. Perhaps put the quote stream or a divided quote stream on a multicast channel, two NICs per box. Allocate a pre-determined amount of capital for each stock so they don't need to coordinate to give you your risk calculation.

Also, the idea of High-Frequency Trading is kind of a fractal concept. On the low end, you have a trading situation that is slightly more active than a day trader. On the high end, you have boys who rewrite network stacks in windows or linux for their boxes co-located with exchange points in New Jersey, where the cables are measured to the inch so everybody sees the tick at the same time. Those are the 100 nanosecond boys. These are the fellows who order servers by the multiple racks of 30. And they trade in a whole different style--their net hold time might average out to zero, but because of agreements they have with the exchanges, they have to stick with the market when it takes a dive.

So don't try to be them at the start.

In summary, success here will come more from the engineering than the computer science, and not trying to beat the big boys at their game.

z_ · on Aug 21, 2011

I suspect you'll want something a touch faster than O(l) for every lookup, in fact given the previous constraint of memory being no problem, you should instead use a hashtable.

rwolf · on Aug 21, 2011

Exactly. I was amazed at how little ground was covered by the end of the post, off on a yak shave for the part of this problem that is not hard or novel.

ghempton · on Aug 22, 2011

The hash function will still be O(l).

veyron · on Aug 21, 2011

I'm happy someone was interested enough to just start coding :)

cdtitch41 · on Aug 29, 2011

Here's some trivial YAML-driven python code: https://bitbucket.org/cdtitch41/itch41/overview

It doesn't construct books or anything else sophisticated but it may be of interest to ETL folks.

dr_rezzy · on Aug 22, 2011

Lots of parallels here: http://www.weatherlabs.com/2011/08/device-detection.html

rday · on Aug 23, 2011

Thanks for all the comments! Veyron had told me a hastable would be a better idea. I have been reading about creating a perfect hash so I don't have to deal with collisions, although I'm (obviously) not the best algorithm guy. So that post may be coming soon.

I had thought that due to the extreme low latency that a hardware solution would be required. So that may be the next place to start experimenting.

ajays · on Aug 22, 2011

Is there a place where we can download end-of-the day ITCH data? There are so many HFT/LLT outfits out there with this data; surely one will be generous enough to release it for free or some trivial amount?

veyron · on Aug 22, 2011

I looked into it: it's expensive to redistribute market data (you have to register with the various exchanges to redistribute their data, and its not cheap). Though nasdaq did have some data available on their ftp site: ftp://emi.nasdaq.com/ITCH/