Hacker News new | past | comments | ask | show | jobs | submit login
Twitter Can Predict The Stock Market (wired.com)
80 points by pmorici on Jan 25, 2011 | hide | past | favorite | 53 comments



Why publish this if it works? One article, a fleeting bit of fame, and a truck load of copycats. It would be more impressive if the article had read: "Researchers, upon discovering tweets predict the stock market, make $100mm before disclosing research to the public." No more need for University grants anymore, much more believable findings. As an aside, data analysis can be tricky. I'm pretty wary of loosely defined research objectives. For example, why is it 3 days? Why is it those 72 words? Over-fitting is a real problem with prediction based stuff.


I've always been fascinated by the romantic idea of writing my own trading engine.

So I did some research, and most people who have written them will tell you that in cases like this, training on past data doesn't correlate well with current & future data.

The stock market of 2011 is not the market of 2008.

But what do I know, not like I've actually done it :)


A good place to start: http://www.collective2.com/

You can rent your trading strategies to others, or rent someone else's strategy.

Also some good info/tools regarding automation.


I haven't looked closely at this site, but it seems like it is almost certain to devolve into a textbook example of adverse selection.

If you have a good strategy, you won't rent it out, you'll trade it. Why risk others frontrunning you? Of course, you might post a historically good strategy and front run it. Or they might just be risky strategies, which look good for a short time (encouraging people to rent them), but which carry catastrophic risks the creator doesn't want to take on.

I can't see a single reason why someone would post a good strategy here.


(Preface: I pretty much know nothing about stock markets and trading.)

Isn't there a whole industry that revolves around paying people to provide you with their trading strategies? How is that considerably different than what's happening on Collective2?

Having a good strategy doesn't mean you have the money to actually trade on it. Or that you can slowly build up your trading bankroll using the same strategy.

Then there's strategies that only yield modest returns. Why not make some money on top of that by renting it out? If you let a dozen people use your strategy, does acting on that information give you much of an advantage? I would guess that depends on how much money those people are trading on your strategy.

I would also guess there aren't many big players renting strategies on Collective2. It's an interesting concept and I think the fact that they've been active since 2003, somewhat validates the idea.

The best thing about it seems to be the ease of using an automated trading agent. I don't know how easy it is to do that elsewhere, but one reason to put your strategy on Collective2 (I'm guessing you can keep it private) would be to use their automation facilities.


There are several whole industries revolving around paying people for help with trading strategies. Most have very specialized economics.

A hedge fund requires capital to operate, and the owners can't necessarily cover fixed costs (salaries, etc) with their own personal capital. I'd be surprised if many of the strategies on collective2 fit this model

Investment advisers often fine tune a strategy to match your personal risks - i.e., help Southwest Airlines hedge their exposure to gas prices, or Apple to hedge their exposure to the RMB. Since Southwest is already short oil due to being an airline, the trading strategy of going long evens them out. It wouldn't make sense for me to trade this strategy, since I don't have an intrinsic short position in oil (plus the alpha in Southwest's strategy comes from selling flights, not oil).

If you let a dozen people use your strategy, does acting on that information give you much of an advantage?

Buy $10k of some low volume stock. Have a few other people pile on and buy the same stock (after you). The price will go up a few cents. Then you sell, probably to the same people buying from you. This is called frontrunning. If you didn't frontrun, you bear the risk that one of your renters would buy the shares before you do, thereby driving up the price before you purchase it. Less of an issue with GOOG, admittedly.


While I agree with your points, I think it is possible there are trading strategies that are defined by needing a certain amount of upfront capital (say $1 million). So, you might know it is a winning strategy after a couple of renters use it successfully, but need to rent it out until you have enough money to use it yourself.


That's the premise of a hedge fund. A strategy generating 9% alpha, but which requires $1 million in fixed costs to manage (office space, salary, market data), might require $30-100 million in operating capital.

It's hard to see how those economics would apply to collective2.com.



Correction--it worked. The authors chose a two-year old sample during which the Dow Jones fell 30.7%. I seriously doubt this will have any predictive power outside of that sample.


Fitting a model to "predict" events that have already happened, isn't anywhere as hard as actually predicting events that have yet to happen.


I'm not so sure about that. Its pretty hard to predict the past too.

Anyone doing serious research into this, will first partition the past data into training and test sets. (And sometimes other validation sets).

So the idea would be to fit a model on one set of past data (the 'training' set), check it works, and then, in the final evaluation, run it on the never seen before, never used, never thought about, 'test' data.

If you have a model trained on 2009, and it also does a great job the first time you run it on the Q1 2010 data that you've never looked at before, I'm now interested, even though every data point is in the past.

I imagine they had to do something like this to pass review.


If you have a model trained on 2009, and it also does a great job the first time you run it on the Q1 2010 data that you've never looked at before, I'm now interested, even though every data point is in the past.

So, you have a model trained on 2009. You try it on the Q1 2010 data and...it doesn't work. Damn. So you throw it out, go back to the drawing board, and try again. And again. And ag...hey, this one works! Trained on 2009 data, and it predicts Q1 2010 perfectly!

Do you trust this model to predict Q2 2010?


Obviously if there has been a 'meta' process of refinement, such that the test set has been used in model development, then its not a clean test set any more, and shouldn't be regarded as such. That's something for any researchers to watch out for, and be sure they don't do. And good researchers are well aware of these pitfalls.

That's why I mentioned the validation sets, and that the test set must never have been looked at, or used before.

But the point stands - if the method works on a clean test set, even if the test set is in the past, then it should be taken seriously.

Would I trust such a model to predict the stock market in Q2 2010? No, because my prior belief is that the stock market is very hard to predict, so I would need very strong evidence to the contrary. But that has nothing to do with having confidence in models that have been tested on historical data.


Right, and they do this for a period from February 2008 to December 2008. We do the same here at the Federal Reserve when developing models.

Sure, the data is historical, but your model doesn't distinguish between "old" and "new" data. If your model predicts test data (in-sample forecasting, right?) well then you have something interesting.


Right, that was my point. It worked for a specially-selected sample during which a bunch of correlated macro factors existed. I commented a bit lower with more details.


I have seen the study a few times, most recently in http://news.ycombinator.com/item?id=1803505 . I think the big problem is selection bias:

The Dow Industrial Average over the last 10 years

http://www.google.com/finance?chddm=997050&q=INDEXDJX:.D...

* Notice that the end of 2008 was unusual for the index. 2008 had the most herded and fearful stock market in recent history. If at anytime the stock market was correlated to mood, it would be then. I am not sure if a 2008 analysis can be generalized to any year but 2008.

* They have not done an analysis on 2009 or 2010, and they chose to split the analysis and pick December 2008 based on a qualitative assumption from the "stabilization of DJIA values after considerable volatility in previous months and the absence of any unusual or significant sociocultural events". December 2008 was very much in the midst of the crisis still.

* For their December "stable" data set, they only used 30 days. That is limited in sample size. There is a big pool to draw from since 2009 as the market has been relatively stable.


Right, upvoted. I read the paper and the authors are very particular in their sample selection. How could someone choose 2008 as a sample? I'd be much more impressed if they used a larger sample.

Also, some food for thought: it would be interesting to see someone testing Twitter moods as an instrumental variable for a project.


I may be wrong, but an initial success rate of 73.3% before adding the emotional data seems like overfitting.


Depending on how they defined success, 73% might be achievable just by looking at co-correlation. Up days and down days tend to run in streaks. If they defined success as predicting 'up' or 'down' for the day for the DJIA, just going with the most popular result for the last N days could work.

But overfitting is definitely still a concern. Looking at the overall trend for the Dow Jones in 2008, I wonder what the success rate of an indicator that always said 'down' would be.


> 73% might be achievable just by looking at co-correlation. Up days and down days tend to run in streaks.

If that were true, there would be an exceptionally easy way to make money: Buy a future or option today based on yesterday's move. Leverage ad infinitum.

Random coin flips also tend to run in streaks, btw - in a few thousand throws, you'll probably have several 10 "head" streaks and several 10 "tail" streaks.


there would be an exceptionally easy way to make money: Buy a future or option today based on yesterday's move. Leverage ad infinitum.

This kind of co-correlation and general market direction is already baked into the option and futures prices. Also, they don't tend to fluctuate as much from day to day, since their prices reflect what the value will be on the contract delivery date, not what the price will be tomorrow. I'm not clear on what profit opportunity you're seeing.

Random coin flips also tend to run in streaks, btw - in a few thousand throws, you'll probably have several 10 "head" streaks and several 10 "tail" streaks.

And in a flat market, that's often the behavior you see. When the market starts trending, though, the coin starts acting 'rigged', and streaks in the prevailing market direction tend to become longer.


I do not know what this "co-correlation" that you speak of is, and google doesn't seem to either. Assuming you are speaking about day-to-day correlation:

I don't know what futures you were thinking of, but financial futures (single stock, index futures, currency futures) track the base value EXACTLY (but also taking into account interest rates, dividends, etc). If this weren't the case, there would be an immediate arbitrage opportunity.

Specifically, once you factor the interest rate out, the DJIA future and the DJIA index are in sync within seconds. The HFT traders take care of that.

And while it is true that the market does trend occasionally (more than a coin flip), timing the start and end of the trade is empirically very hard.

If you know the market is trending, why don't you buy a future betting on the trending direction, with a stop at 2 ticks above and below your entry price? If the market is trending, you have positive winning expectation.

Except the market doesn't work that way - and if you think the market is trending when it isn't, you lose money with this scheme.


The trouble with these financial models is that once they become common knowledge, it's too late. The market absorbs these algorithms into its pricing mechanism and renders no further arbitrage profits.


So can google. Supposedly Sergey even suggested to start a Hedge fund, but it would probably be insider trading if they would do their decisions based on the user data.


How could it be insider trading, if they're not doing anything with GOOG?


Companies that have non-public information about companies that they trade with are barred from using that information to trade stocks. For example if company B which manufactures bullets orders gunpowder from company G, and company B starts ordering a lot more gunpowder, employees from company G can't buy more stock of company B due to that knowledge. It seems like Google could fall into the category of company G since it has non-public search terms from other companies.


Using some Google customer's secret information to trade also counts as insider trading. If anything, it's more unethical.


I would need to hear more to be convinced. The fact that they had a large number of signals they were tracking, without a clear rationale for any one of them, is troubling.

Consider a set of random signals; arbitrarily select one as the benchmark. Then from among the rest take the signal that best predicts the daily direction of the benchmark. That signal will likely have much better than 50% accuracy because by definition the worst signal will be around 50% accurate (if it were any less it would have an equally useful inverse correlation).


There is a typo in the article: the software they are using is called OpinionFinder and can be found here http://www.cs.pitt.edu/mpqa/opinionfinderrelease/


I skimmed the paper, but I couldn't find very much information on how they did the cross validation (like, what dates, they trained, and then what dates they tested the prediction) Also - I do believe that tweet sentiment can predict the stock market, but not on such a large timescale. I would guess that any analyst reading the news could have a good estimate of sentiment, at least as good as the twitter opinion finder. I think the twitter opinion finder is useful when you want to measure sentiment at a rate higher than that which humans can do it.


I remember doing some research about this a while ago. Getting some sort of text-based emotional index isn't trivial at all, there are few hardly viable solutions (google's prediction api and Bayes-based algorithms), but they aren't really accurate. This is also been tried in the past by startups of techcrunch fame such as stockmood.com, all failed miserably. Props to twitter or anyone who will succeed at this.


I have been doing research on this problem, send me an email if you'd like to connect.


I know very little about trading but even I can see a whole bunch of red flags here. Firstly, if it has just made it to the news then its probably a decade too late to take advantage of. Recently there was a news article about how firms have software that reads and operates based on news events. Except that this recent 'news' article was about something that had already been happening for years. Secondly, twitter contains a subset of information in the market. No surprise that there will be some correlation.

Then: predicting up or down movement of a stock is very vague. At what time scale, what sort of trades are required and what sort of response times to execute. What are its drawdowns like, does it account for taxes, commission fees etc. Next, use of a complex nonlinear learning model with lots of parameters - raises alarm bells - these tend to be very susceptible to noise, trading data is highly correlated and typical regularization methods often do not suffice. Then there is the whole issue of over-fitting in general, data used to train on (size, survivor bias, accounting for splits and what not) which makes the whole thing very hand wavy. Without additional info as basic as rate of return, the stated 83% accuracy is meaningless. Like with all things, its easy to get results that work within the limited and safe confines of academic testing but actually shipping a working product is another story.

There has always been a draw to beating the stock market. And these days there is nothing more romantic than doing so using Artificial Intelligence! But I think the most important part of any trading strategy is to be made up of parts that are constantly being swapped out and replaced based on research. you can't just throw a machine learning algorithm at it and think job done. The thing will likely only profit for a couple microseconds. however, as an aside, I would not be surprised if one of [anti]spam/virus/botnets or HFT wars will one day produce AI.


Edit: removed some points b/c charlief made them more succinctly.

An even better question: is the relationship causal? The researchers use Granger causality analysis to test their hypothesis. Wikipedia tells me this analysis "may produce misleading results when the true relationship involves three or more variables." [2] By definition, Twitter and the DJIA are macro aggregates of a number of factors. How could the researchers apply Granger here?

[1] See Table 1 at http://www.sca.isr.umich.edu/documents.php?c=c

[2] http://en.wikipedia.org/wiki/Granger_causality#Limitations



I find this interesting because the consensus among the economic community is that markets are highly efficient, that is, information is reflected immediately in stock market prices. This suggests information exists which is not being reflected. That's why I'm skeptical.


>the consensus among the economic community is that markets are highly efficient

Really? There are some true believers out there under this impression, but I didn't think anyone credible was. It wasn't so long ago that someone showed efficient markets were an P=NP problem.

EDIT: I'm not the one who downvoted you.


I suspect you believe markets are close to efficient. If you don't, you are either investing large fractions of your wealth in a strategy that you believe will beat the market, or you are irrationally throwing your money away. Which is it?

The NP completeness of efficient markets has been known for quite a while.

http://dpennock.com/papers/pennock-ijcai-workshop-2001-np-ma...

It wasn't so long ago that some jerks at Princeton wrote a paper along the same lines, completely ignored all the existing literature to make their paper appear more novel, and got a lot of publicity for themselves (hint: prediction markets are unsexy, CDO's are sexy).


NP completeness is a strawman here. It's perfectly plausible to have an efficient market where the problem of accurate pricing is NP complete.

Furthermore, the paper you reference (while an excellent and fascinating paper) does not directly bear on the NP completeness of the stock market:

> In Section 3, I discuss the prospect of opening securities markets that pay off contingent on the discovery of solutions to particular instances of an NP-complete problems. Such NP markets would provide direct monetary incentives for developers to test and improve their algorithms, and allow funding agents to target rewards to the designers of the best algorithms for the most interesting problems. In Sections 4 and 5, I discuss markets in #P-complete problems, where prices serve as collective approximate bounds on the number of solutions, and bid-ask spreads may indicate problem difficulty

is his summary of what the paper does (sections 1 and 2 are introductory material). I claim that this does not at all show the NP completeness of markets, and further that it's a claim irrelevant to the discussion here.

In what sense are you claiming that he proves the "NP completeness of markets"? What does that mean? Why is it relevant to whether or not to invest money in the stock market?

(sidenote: I don't think the question of the NP-completeness of some questions related to stock pricing is irrelevant or uninteresting; indeed I just applied to grad school to study problems like these. I just don't think they bear on what you're implying they do)

That said, I voted you up because of your first sentence.


Neither the paper I linked to nor the paper written buy the Princeton guys shows that equities markets are NP complete. They both show that markets in certain derivatives are NP complete.

However, I made a mistake and linked to the wrong paper. Here is the correct one:

http://dpennock.com/papers/fortnow-dss-2004-compound-markets...

Basically, the result says that if you have a market in derivatives which pays off when certain formulas in propositional logic are true (e.g., a derivative which pays off if A && (!B || C) is true, for specific events A,B,C), then the auctioneer's matching problem is NP complete. The auctioneer's matching problem is simply market making, and if the market were efficient, this problem would already be solved (by looking at prices).

I don't think that loewenskind's claim is true, for the most part, I was just providing a more detailed source on NP completeness of some markets.


Hypothetically, couldn't I believe that some large inefficiency exists but that I don't know what it is?


Highly efficient != totally efficient. If you think you have some place where you know the market is inefficient, and you are right, you can go make money. (Or rather, your counterpart in the financial world will.) Now the market is that much more efficient again.

I think people citing the fact the market isn't totally efficient aren't always proving what they think they are proving. The practical difference for the vast bulk of us between a market that is totally efficient, and one that is only mostly efficient but it is very very hard to find the inefficiencies, is pretty much zero. I often see people try to leverage this into the idea that markets are efficient as somehow being "touchingly naive" to score political points in various political fights, but, well, there's a reason you have to reach for that emotional trick, because the facts don't really support the idea of some grossly inefficient market in practice. (Instead, the problem is that efficient doesn't mean what you think it does; it certainly doesn't mean "good" or "moral" or "stable" or anything like that.)


Just because information is reflected immediately doesn't mean that it's reflected correctly.


Is Twitter predicting the market or are traders moving it based on perception of prediction?


Out of curiosity, has anyone used, or know someone who uses http://stocktwits.com/ to inform their trading decisions - and had a positive outcome? I haven't, I'm just curious.


Haven't used it, but not sure why most would when there are already tools like this built in to most online brokerages. Most of the trading advice I have seen goes in two places, mass market appeal such as this, or obscure and sometimes secretive forums.


If you can decode a way to 'beat the market' that means once you start beating the market someone else is loosing. They will adjust their trading techniques and your "algorithms" are now wrong


I think you misunderstood the efficient market hypothesis.


I think you misunderstand 'hypothesis'


I doubt those hn users that think this is working will say so. ;)


Can someone please link to a download of the OpenFinder module?


this is pretty fascinating.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: