Everyone can try to do it, but you need to spend the time and educate yourself like in any other profession or hobby and the first step is to go to eBay and search for historical stock market data, you can buy 20 years of data for less than $100 and you can test all trading ideas for free and without losing a single penny...the barrier for entry is very low, some Python knowledge + Linux machine and the data and off you go...
Dude, seriously...you have to stop peddling this ebay recommendation. No one is going to be competitive with data purchased on an auction website.
In case anyone is wondering where they should actually get data to be competitive, I can personally speak for these two:
* Nanex NxCore
* CBOE Livevol
I haven't purchased data from here, but I have heard it well recommended from people I trust:
* TickData
* QuantQuote
There are likely others, but bear in mind that the higher the data quality, the higher the price. For most strategies you'll likely want vetted intraday data, preferably at minute resolution or lower. Tick data is better, but it's going to be huge (my drives are near 100TB). You can reformat it into custom bar sizes if you have data on the actual trades/quotes.
My point is that data is not competitive. If you give me an example of a strategy you use with that data, I'll illustrate the issues with using data at the resolution ebay sells (to say nothing of things like survivorship bias...).
If you insist on data at this price point, why not just scrape Yahoo Finance? There are literally GitHub projects that obviate this for you.
Please, if you'd like to help the community, just share a good data source... I would never share my winning strategy with some "know it all" on a public forum :-)
It doesn't need to be your actual strategy, whatever. I have no desire to "steal" whatever your system is.
What I am trying to demonstrate here is that your strategy can perform significantly differently if it is using data on different time resolutions. If you find an inefficiency that can be traded by analyzing snapshot data at 1 minute bars, that might turn out to be nothing at all when you look at the tick data. If your strategy is profitable but you don't have real insight into the trades versus just the quotes, you might have unrealistic expectations for fills, or hold times, etc...and if your strategy is profitable, your data is the first way in which you can try to prove it's not correlated with the broader market's regime.
Thanks,but you don't have to prove anything to me... my trading account is my proof that I don't have to compete with the flash boys to make money, so I don't care about tick data...
Alright, could you help me understand what type of data it is that you purchased from ebay? And what type of strategy you're using, at the highest level? Are you trading equities? Derivatives? Can you at least tell me how long you typically hold a position?
I'm not interested in your actual strategy, I just want to understand what you're doing categorically to see if there's any way it can possibly work instead of being attributable to a bull market. You should be able to talk about the type of strategy this is without discussing the actual signal(s) key to the strategy. This is just becoming kafkaesque for me. I'm legitimately shocked no one else has asked you about this just out of basic curiosity about how you arrived at your current methodology. You mentioned Flash Boys and tick data...but HFT is not the only paradigm for which you'd use tick data.
I'm trying to give you the benefit of the doubt here but from my experience none of what you're doing makes sense. You have to understand that the reason I'm giving you so much flak for this is because you're repeatedly recommending people buy financial data from ebay, you're being cagey about why your recommendation is sound, and your recommendation is utterly alien to the way in which people professionally trade using historical data.
If your data from ebay is actually, somehow reliable, then fine; but if it isn't, recommending it for people who are looking for actionable data is irresponsible.
You can come up with "long term" (hold times of weeks I guess) trading ideas with nothing more than close prices right? On the assumption that you only ever trade the auction you'd be getting a reasonable approximation of your actual experience.
I mean, I'm not sure I accept the premise that there's enough information there to find significant alpha - but if you assume that he's using a different data source for that, he can test the trading performance on close prices alone.
Good morning dsacco, I really like your way of thinking based on the other comments you are posting, but I cannot understand why you are so against cheap or free (if we can find it) data? 20 years of OHLCV daily data containing all listed and delisted symbols is perfect for someone who is just starting testing trading ideas like trend following for example, but most importantly will encourage them to think about their money!
Your alternative for people to pay $3000 or more will discourage all of them and will leave them to be victims of financial advisors, companies like fintech startups, internet gurus and buy and hold evangelists. This is why I’ll always promote the $100 eBay data until I find a cheaper or a free version, maybe you can help the community here?
If you are going to use machine learning on the data, though, make sure you know what you are doing unless you are just using someone else's complete package. It's really easy to screw up machine learning.
I recall an example given in a class I took. (I may be misremembering the details, though).
Some people were trying to apply machine learning to currency trading. They had a bunch of data. They normalized the data (a common step in machine learning), and divided it into training and test and validation sets, and trained their model. Everything looked great, and they were getting excellent results on the test set.
When they went live with real money instead of making the nice profit predicted, they lost a lot of money.
Their mistake? They normalized the whole data set up front. What they should have done is split it into the training, test, and validation sets, and normalized each of those individually. Normalizing before splitting compromises the independence of the sets, biasing the learning.
(I must admit I never did quite understand this. Normalizing is optional. As far as I understand if one does an arbitrary transformation on one's data as a whole that should not actually make learning go bad, at least as long as the same transformation is applied to all the input when you go live. So when they did a normalizing step on the whole data set, why wasn't that just like doing any other arbitrary transform? There is serious dark magic here...)
I've read about a similar scenario (possibly the same one?), and you got the important bits right.
You cannot normalize the entire dataset at once; the normalization contains some parity effectively allowing the algorithm to cheat and see the future... but in real life we can't see the future.
The simple example that eliminates some of the "black magic" is that normalizing against the entire set lets the algorithm know what the highest and lowest points across the entire data set are - and knowing the lowest and highest lets the algorithm buy low/sell high for all of the known data.
Without a minute understanding of what's involved, I'd guess that normalizing is a non-local operation, in the sense that the effect of normalizing some x depends on the other elements of your dataset, so that normalization includes the "information" of the rest of the data implicitly.
In real life, that's sort of like having access to the answers when you're given the questions. Being trained in that environment won't do much good.
I think you've got it. The part about normalization across the entire dataset implicitly including information is by definition true (it contains, at a minimum the true min/max). Can't speak to the rest, though, as I'm still learning myself... :)
1. Almost all buyers are institutional funds or individuals with the means to trade as well-informed investors. To put it succinctly, they're in it as serious business, because it's incredibly expensive. They have no incentive to make the edge they just purchased for five - six figures public.
2. These vendors go to various lengths to protect the data, including steganographic "trap streets" to identify the account the data belongs to.
3. There are significant legal liabilities inherent to releasing the data without a redistribution license. You agree to these explicitly by buying the data.
Sorry, I should have clarified my question. Why aren't the exchanges themselves publishing the data onto torrents?
My layman's understanding is the more actors that participate on the exchanges, the more volume, the more the exchanges profit. If this is true, then making the historical trading data would encourage more entrants onto the exchanges, and the exchanges would earn far more than selling the data to a few large institutions.
I think it's not like there are tons of qualified people with a great interest in quant-y things whose only barrier to entry is a lack of data. It's sort of a monopoly+monopsony thing (I guess, this is highly likely wrong) where there's no incentive to expanding sales further because you're already hitting almost the entire market, and entry to the market isn't a problem for potential buyers.
Actually, Nasdaq makes more money on licensing its data than on the rest of its businesses. You can go to each exchange and buy the historical data directly. Generally, this is better for backdating depth of book data than say a Bloomberg in my experience but requires a lot more upfront work (parsing, normalizing). Additionally, any non-hft trader that handles institutional trading needs to pay a small fee (roughly $20/month per exchange) in each system that requires real-time data. Also, little-known fact, no retail trading orders hit the actual exchange. Even institutional investors who are not brokers (like most hedge funds) must execute through a broker/EMS.
I have a hard time seeing how it could be copyrighted, at least if we are talking about comprehensive stock listings organized in the obvious way (e.g., a table of prices organized by date, where the stocks included are chosen by some straightforward criteria).
In the United States I'd expect this to be covered by Feist Publications, Inc., v. Rural Telephone Service Co., 499 U.S. 340 (1991).
It's considered a "compilation". Feist addresses this in the links you supplied. However, if you're not used to reading law, this might help explain it in normal English. (Court documents are most definitely not normal English!)
That was why I threw in the mentions of comprehensiveness, obvious organization, and obvious selection. Feist requires at least a minimum degree of creativity in the organization and selection for a compilation to be copyrightable.
A compilation whose selection criteria is, for example, "everything traded on NASDAQ" and whose organization is "order by date" seems unlikely to me to have sufficient creativity to qualify.
I trade options using Livevol data. Good historical options data doesn't come cheaper than about $3,000/year (I'm willing and grateful to be proven wrong, but I've looked).