Everyone can try to do it, but you need to spend the time and educate yourself l...

dsacco · on May 22, 2017

Dude, seriously...you have to stop peddling this ebay recommendation. No one is going to be competitive with data purchased on an auction website.

In case anyone is wondering where they should actually get data to be competitive, I can personally speak for these two:

* Nanex NxCore

* CBOE Livevol

I haven't purchased data from here, but I have heard it well recommended from people I trust:

* TickData

* QuantQuote

There are likely others, but bear in mind that the higher the data quality, the higher the price. For most strategies you'll likely want vetted intraday data, preferably at minute resolution or lower. Tick data is better, but it's going to be huge (my drives are near 100TB). You can reformat it into custom bar sizes if you have data on the actual trades/quotes.

SirLJ · on May 22, 2017

This the cheapest and most accurate I can find, if you have better source for $100 or less, please share with us...

dsacco · on May 22, 2017

My point is that data is not competitive. If you give me an example of a strategy you use with that data, I'll illustrate the issues with using data at the resolution ebay sells (to say nothing of things like survivorship bias...).

If you insist on data at this price point, why not just scrape Yahoo Finance? There are literally GitHub projects that obviate this for you.

SirLJ · on May 22, 2017

Please, if you'd like to help the community, just share a good data source... I would never share my winning strategy with some "know it all" on a public forum :-)

dsacco · on May 22, 2017

It doesn't need to be your actual strategy, whatever. I have no desire to "steal" whatever your system is.

What I am trying to demonstrate here is that your strategy can perform significantly differently if it is using data on different time resolutions. If you find an inefficiency that can be traded by analyzing snapshot data at 1 minute bars, that might turn out to be nothing at all when you look at the tick data. If your strategy is profitable but you don't have real insight into the trades versus just the quotes, you might have unrealistic expectations for fills, or hold times, etc...and if your strategy is profitable, your data is the first way in which you can try to prove it's not correlated with the broader market's regime.

avn2109 · on May 22, 2017

It's super unclear why this is downvoted, seems to me like he has an entirely valid point about strong dependence on the time resolution of data.

SirLJ · on May 22, 2017

Thanks,but you don't have to prove anything to me... my trading account is my proof that I don't have to compete with the flash boys to make money, so I don't care about tick data...

dsacco · on May 22, 2017

Alright, could you help me understand what type of data it is that you purchased from ebay? And what type of strategy you're using, at the highest level? Are you trading equities? Derivatives? Can you at least tell me how long you typically hold a position?

I'm not interested in your actual strategy, I just want to understand what you're doing categorically to see if there's any way it can possibly work instead of being attributable to a bull market. You should be able to talk about the type of strategy this is without discussing the actual signal(s) key to the strategy. This is just becoming kafkaesque for me. I'm legitimately shocked no one else has asked you about this just out of basic curiosity about how you arrived at your current methodology. You mentioned Flash Boys and tick data...but HFT is not the only paradigm for which you'd use tick data.

I'm trying to give you the benefit of the doubt here but from my experience none of what you're doing makes sense. You have to understand that the reason I'm giving you so much flak for this is because you're repeatedly recommending people buy financial data from ebay, you're being cagey about why your recommendation is sound, and your recommendation is utterly alien to the way in which people professionally trade using historical data.

If your data from ebay is actually, somehow reliable, then fine; but if it isn't, recommending it for people who are looking for actionable data is irresponsible.

Ntrails · on May 22, 2017

You can come up with "long term" (hold times of weeks I guess) trading ideas with nothing more than close prices right? On the assumption that you only ever trade the auction you'd be getting a reasonable approximation of your actual experience.

I mean, I'm not sure I accept the premise that there's enough information there to find significant alpha - but if you assume that he's using a different data source for that, he can test the trading performance on close prices alone.

SirLJ · on May 22, 2017

infinite8s · on May 22, 2017

Maybe he's the one selling the data.

SirLJ · on May 22, 2017

No good deed goes unpunished here I guess :-(

SirLJ · on May 22, 2017

Good morning dsacco, I really like your way of thinking based on the other comments you are posting, but I cannot understand why you are so against cheap or free (if we can find it) data? 20 years of OHLCV daily data containing all listed and delisted symbols is perfect for someone who is just starting testing trading ideas like trend following for example, but most importantly will encourage them to think about their money! Your alternative for people to pay $3000 or more will discourage all of them and will leave them to be victims of financial advisors, companies like fintech startups, internet gurus and buy and hold evangelists. This is why I’ll always promote the $100 eBay data until I find a cheaper or a free version, maybe you can help the community here?

tzs · on May 22, 2017

If you are going to use machine learning on the data, though, make sure you know what you are doing unless you are just using someone else's complete package. It's really easy to screw up machine learning.

I recall an example given in a class I took. (I may be misremembering the details, though).

Some people were trying to apply machine learning to currency trading. They had a bunch of data. They normalized the data (a common step in machine learning), and divided it into training and test and validation sets, and trained their model. Everything looked great, and they were getting excellent results on the test set.

When they went live with real money instead of making the nice profit predicted, they lost a lot of money.

Their mistake? They normalized the whole data set up front. What they should have done is split it into the training, test, and validation sets, and normalized each of those individually. Normalizing before splitting compromises the independence of the sets, biasing the learning.

(I must admit I never did quite understand this. Normalizing is optional. As far as I understand if one does an arbitrary transformation on one's data as a whole that should not actually make learning go bad, at least as long as the same transformation is applied to all the input when you go live. So when they did a normalizing step on the whole data set, why wasn't that just like doing any other arbitrary transform? There is serious dark magic here...)

matheweis · on May 22, 2017

I've read about a similar scenario (possibly the same one?), and you got the important bits right.

You cannot normalize the entire dataset at once; the normalization contains some parity effectively allowing the algorithm to cheat and see the future... but in real life we can't see the future.

The simple example that eliminates some of the "black magic" is that normalizing against the entire set lets the algorithm know what the highest and lowest points across the entire data set are - and knowing the lowest and highest lets the algorithm buy low/sell high for all of the known data.

mrkgnao · on May 22, 2017

Without a minute understanding of what's involved, I'd guess that normalizing is a non-local operation, in the sense that the effect of normalizing some x depends on the other elements of your dataset, so that normalization includes the "information" of the rest of the data implicitly.

In real life, that's sort of like having access to the answers when you're given the questions. Being trained in that environment won't do much good.

matheweis · on May 22, 2017

I think you've got it. The part about normalization across the entire dataset implicitly including information is by definition true (it contains, at a minimum the true min/max). Can't speak to the rest, though, as I'm still learning myself... :)

yourapostasy · on May 22, 2017

Why isn't this data out on torrents? Is it somehow copyrighted?

dsacco · on May 22, 2017

There are a few reasons.

1. Almost all buyers are institutional funds or individuals with the means to trade as well-informed investors. To put it succinctly, they're in it as serious business, because it's incredibly expensive. They have no incentive to make the edge they just purchased for five - six figures public.

2. These vendors go to various lengths to protect the data, including steganographic "trap streets" to identify the account the data belongs to.

3. There are significant legal liabilities inherent to releasing the data without a redistribution license. You agree to these explicitly by buying the data.

yourapostasy · on May 22, 2017

Sorry, I should have clarified my question. Why aren't the exchanges themselves publishing the data onto torrents?

My layman's understanding is the more actors that participate on the exchanges, the more volume, the more the exchanges profit. If this is true, then making the historical trading data would encourage more entrants onto the exchanges, and the exchanges would earn far more than selling the data to a few large institutions.

mrkgnao · on May 22, 2017

I think it's not like there are tons of qualified people with a great interest in quant-y things whose only barrier to entry is a lack of data. It's sort of a monopoly+monopsony thing (I guess, this is highly likely wrong) where there's no incentive to expanding sales further because you're already hitting almost the entire market, and entry to the market isn't a problem for potential buyers.

pmart123 · on May 25, 2017

Actually, Nasdaq makes more money on licensing its data than on the rest of its businesses. You can go to each exchange and buy the historical data directly. Generally, this is better for backdating depth of book data than say a Bloomberg in my experience but requires a lot more upfront work (parsing, normalizing). Additionally, any non-hft trader that handles institutional trading needs to pay a small fee (roughly $20/month per exchange) in each system that requires real-time data. Also, little-known fact, no retail trading orders hit the actual exchange. Even institutional investors who are not brokers (like most hedge funds) must execute through a broker/EMS.

tzs · on May 22, 2017

I have a hard time seeing how it could be copyrighted, at least if we are talking about comprehensive stock listings organized in the obvious way (e.g., a table of prices organized by date, where the stocks included are chosen by some straightforward criteria).

In the United States I'd expect this to be covered by Feist Publications, Inc., v. Rural Telephone Service Co., 499 U.S. 340 (1991).

[1] https://en.wikipedia.org/wiki/Feist_Publications,_Inc.,_v._R....

[2] https://www.law.cornell.edu/copyright/cases/499_US_340.htm

MR4D · on May 22, 2017

It's considered a "compilation". Feist addresses this in the links you supplied. However, if you're not used to reading law, this might help explain it in normal English. (Court documents are most definitely not normal English!)

https://www.unc.edu/courses/2006spring/law/357c/001/projects...

tzs · on May 22, 2017

That was why I threw in the mentions of comprehensiveness, obvious organization, and obvious selection. Feist requires at least a minimum degree of creativity in the organization and selection for a compilation to be copyrightable.

A compilation whose selection criteria is, for example, "everything traded on NASDAQ" and whose organization is "order by date" seems unlikely to me to have sufficient creativity to qualify.

HappyTypist · on May 22, 2017

Nondisclosure agreements.

matheweis · on May 22, 2017

What about options? So for I haven't been able to find any good source other than CBOE and their resellers, which are all quite expensive.

dsacco · on May 22, 2017

I trade options using Livevol data. Good historical options data doesn't come cheaper than about $3,000/year (I'm willing and grateful to be proven wrong, but I've looked).

matheweis · on May 22, 2017

CBOE (the actual exchange) is about $500 per symbol for EOD + greeks for the entire last ~15 years.

brobinson · on May 22, 2017

How valuable is EOD data actually?

I saw that the CBOE also offers 3:45pm data before the spreads widen up (sometimes drastically) when positions get unwound.

dsacco · on May 22, 2017

Thanks, but I should have clarified: I meant intraday data like minute quotes or actual trades.

auvi · on May 22, 2017

didn't know that you can buy historical stock market data on eBay. what search queries to enter?

SirLJ · on May 22, 2017

I found it by searching for "historical stock market data" :-)