Machine Learning Can't Handle Long-Term Time-Series Data

mjburgess · on Jan 5, 2020

Time is only a symptom of what's missing: causation.

ML operates with associative models of billions of parameters: trying to learn thermodynamics by parameterizing for every molecule in a billion images of them.

Animals operate with causal models of a very small number of parameters: these models richly describe how an intervention on one variable causes another to change. These models cannot be inferred from association (hence the last 500 years of science).

They require direct causal intervention in the environment to see how it changes (ie., real learning). And a rich background of historical learning to interpret new observation. You need to have lived a human life to guess what a pedestrian is going to do.

If you overcome the relevant computational infinities to learn "strategy" you will still only do so in the narrow horizon of a highly regulated game where causation has been eliminated by construction (ie., the space of all possible moves over the total horizon of the game can be known in an instant).

The state of all possible (past, current, future) configurations of a physical system cannot be computed -- it's an infinity computational statistics will never bridge.

The solution to self-driving cars will be to try and gamify the roads: robotize people so that machines can understand them. This is already happening on the internet: our behaviour made more machine-like so it can be predicted. I'm sceptical real-world behaviour can be so-constrained.

nextos · on Jan 5, 2020

Exactly, that's why I think we need to put logic and probability theory back into cutting edge ML. [1,2] are only early approaches that show potential directions to achieve this.

Deep learning is very useful, but only one piece of the whole AGI puzzle.

Furthermore, many AI problems will benefit the generality of being formulated as a probabilistic program synthesis problem [3]. In this framework, lots of program semantics (~formal methods) concepts like abstract interpretation [4,5] might become very useful. They allow to explore huge program spaces very quickly.

Lastly, Pearl's do calculus [6] is a good starting point closely related to [1].

[1] http://probmods.org/

[2] http://pyro.ai/

[3] https://web.mit.edu/cocosci/Papers/Science-2015-Lake-1332-8....

[4] http://www.concrete-semantics.org/concrete-semantics.pdf

[5] http://adam.chlipala.net/frap/frap_book.pdf

[6] http://bayes.cs.ucla.edu/BOOK-2K/causality2-epilogue.pdf

sjg007 · on Jan 6, 2020

Well you need a model of the world you can simulate. That would be where you start. We could probably program a world simulation to help the robot learn in the environment. There would be some uncertainty though between the simulation and real world as well as uncertainty in the simulated vs learned object concepts. Interesting questions is if deep learning could learn the concept of gravity?

iandanforth · on Jan 5, 2020

The causal argument suffers from a problem of nomenclature.

On one side we have the colloquial understanding of cause and effect where a cause is a true impetus of effect. On the other side we have "causal" learning in biology where you're not actually learning causes, just strong correlations. We can learn just about any temporal association even if there is no direct cause-effect relationship. Random reward structures are a way to illustrate this: present a reinforcing stimulus to an animal at random times and a random subset of behavior will increase in frequency. The animal develops a false "causal" belief that a series of its actions is influencing the presentation of a reward.

That's why I like focusing on "sequence prediction", even colloquially we know predictions can be wrong. Those predictions can be influenced by low-d world models, but you don't accidentally elevate that model to claim a pure/symbolic/accurate model as can happen with incautious use of the words like "causal."

mjburgess · on Jan 5, 2020

The key element here is intervention. Animals learn by changing their environment.

Superstition in pigeons arises because they believe their actions cause the reward, it isnt "mere sequence". Any distribution over two variables observed overtime, for all time can change unpredictably given an environmental change.

Animals have rich models of objects and their behaviour over time, these models aren't "sequential", and they are brought to bare on deciding whether mere sequences should be regarded causally.

iandanforth · on Jan 7, 2020

'Animals have rich models of objects and their behaviour over time, these models aren't "sequential"'

This strikes me as patently false, which means I'm probably not understanding what you mean. What does forward simulation mean if it isn't sequential?

perl4ever · on Jan 5, 2020

You sound like an expert, but I have a really strong negative reaction to saying the issue is just nomenclature.

People give lip service to "correlation is not causation" by say, well, we're not going to call our correlations causation, we're just going to use them like that. There, are your delicate sensibilities satisfied?

No! Because if you have the wrong causal relationships whatever you call them used to predict stuff, the predictions will be wrong! Totally, not 1% off.

naasking · on Jan 5, 2020

> Animals operate with causal models of a very small number of parameters: these models richly describe how an intervention on one variable causes another to change. These models cannot be inferred from association (hence the last 500 years of science).

* Tegmark and Wu's AI Physicist: https://news.ycombinator.com/item?id=18381827

* AIXI: https://en.wikipedia.org/wiki/AIXI

scottlegrand2 · on Jan 5, 2020

Or you could engineer the roads and the cars to provide telemetry on where they are in high resolution in real time.

This would reduce a nearly impossible job in machine intelligence to a really difficult simulation problem.

But I don't see that happening anytime soon.

blackeneth · on Jan 5, 2020

We build cell towers everywhere. It’s not a stretch that we could build telemetry towers everywhere. The main problem is an economic chicken-and-egg problem: need to invest in towers before cars will be built. The one advantage is that one can be assured of success in developing a self-driving road system.

Now if we could reuse cell tower signals, the telemetry system is already in place. Cell phone operators keep the location of their cell towers secret, but with some persuasion ($$$) would reveal their coordinates.

fbender · on Jan 6, 2020

I see a huge risk for manipulation and spoofing. You‘ll have to establish some kind of trust if you want to rely on beacon data (from whatever source) for navigation and safety. Just imagine that someone spoofs a signal that triggers autonomous cars to emergency break – depending on road conditions and/or if there are non-autonomous cars as well, an attacker can create some serious damage, injury, and even death.

Given the existence of spoofed base stations (stingrays) among other reasons, a PKI-based solution may not be sufficiently safe. So you‘d have to overlay beacon data with sensors, at which point it’s questionable if there‘s a significant added benefit.

nabla9 · on Jan 5, 2020

This is clever crackpottery type brainstorming from a smart person.

The author has extremely grand set of connections he developed. It ties down Buddha, enlightenment, vipassana meditation, artificial intelligence, cybernetics, fractals and neuroscience. Nothing wrong with that, of course.

Creative thinker should have these kind of crazy ideas and connections every day or at least once a week. I carry with me a notebook that is full of them.

Most ideas die as 'premature babies'. They may be interesting to think and write down, but they are not fully developed and never fit together as well as you initially thought. Filtering and piking some of them to work with is important. Giving them up is the difference between crackpot and non-crackpot.

Forcing grand connections prematurely makes this crackpottery type. Sharing the creative brainstorm in an essay that does not try make up connections would be easier to read.

scottlocklin · on Jan 5, 2020

>This is clever crackpottery type brainstorming from a smart person.

I think you're being too kind here. It's just crackpottery as far as I can tell. I agree with you though that it is the type of dumb idea that should die in a private notebook.

scottlegrand2 · on Jan 5, 2020

Like so many of these rants, he doesn't actually demonstrate his ideas fix the problem, he just insists that they do so without proof.

Less thought-leadering, more actual results please.

anst · on Jan 5, 2020

Yeah, stir some mysterious ideas, maybe someone sees there a meaning. Quite surprised with this kind of vague magical thinking coming from lesswrong (thought they were rationalists or something).

IC4RUS · on Jan 5, 2020

I was surprised too, but it appears that the post has negatives votes, so I take it that it wasn't quite rational enough for the rest of the community.

mcguire · on Jan 5, 2020

Are you reading the same article?

tgflynn · on Jan 5, 2020

The line between crackpottery and genius is a fine one. If blogs had existed in 1900 and a certain patent clerk had written a post on his ideas about clock synchronization somehow being related to electromagnetism, many would have dismissed him as a crackpot as well.

The questions this article relates to are among the most profound and difficult that human reason has ever attempted to confront. I think one should be careful in labeling such ambitious speculation as crackpottery just because it doesn't yet amount to a fully coherent and formally testable theory.

throwawaymath · on Jan 5, 2020

Okay...but Einstein didn't write a blog. He published a paper for peer review. Some of his ideas remained controversial for decades, but he had a sufficiently mature, cogent and well-specified theory that he could at least work through hypotheses and publish results.

tgflynn · on Jan 5, 2020

He published that paper in 1905. Is it absurd to think that if the Internet existed in his time he might have blogged about his preliminary ideas before publishing a formal paper ?

throwawaymath · on Jan 5, 2020

Absurd is a really strong word. I will say that I really doubt he'd blog about his ideas instead of just publishing them, even on arXiv, because all the examples of groundbreaking new work in the modern era have been blogged about contemporaneous with, or after peer review of formal papers.

I'll also go further and say that, while there's a kernel of validity to your analogy, it's not the right analogy with which to deliver your overarching point. I don't think the publishing method for one of the most significant scientific advancements of the previous century is a particularly good lens for analyzing this blog post.

The critical content of this post is far below the threshold usually associated with an idea sufficiently well formed to be publishable. Einstein had a minimum viable theory before he solicited feedback; and when he did solicit that feedback, it was through what we'd consider orthodox channels.

tgflynn · on Jan 5, 2020

What you say is true.

The main point I was trying to make was that given a speculative post of such breadth, which touches on such difficult issues as AGI, how the brain works and perhaps even the nature of conscious experience, and which makes some claims that are at least interesting, I think it's quite presumptuous to assert that these ideas are all nonsense without a deeper exploration of them. I certainly would not want to make such an assertion, despite being troubled by what I think are some inaccuracies in the author's description of certain physical concepts.

Now a secondary issue is that it is true that as far as I'm aware major scientific discoveries have typically been initially published in much more developed form and have thus been the work of a single individual or of a relatively small group of closely affiliated individuals. I'm not convinced however that this historical model of very small scale scientific collaboration is necessarily the only one nor the best one in light of modern means of communication.

It seems at least conceivable to me that there is a possible future in which the following hold:

* There is some kernel of validity in this author's ideas.

* A small number of other people find them intriguing and choose to collaborate with the author to further elaborate them.

* This collaboration leads to major progress in our understanding of one or more of the areas mentioned above.

For me the, admittedly very small, likelihood of such an outcome, justifies the author's post and its appearance on HN.

gumby · on Jan 5, 2020

> If blogs had existed in 1900 and a certain patent clerk had written a post on his ideas about clock synchronization somehow being related to electromagnetism, many would have dismissed him as a crackpot as well.

That "certain patent clerk" was hardly working in a vacuum; not only was he building upon the work of others (e.g. Lorentz just to pick someone) but he had been in school with some of them and was in constant communication with them, as he was hardly the only one working on the problem.

This is not to minimize his brilliance (Special Relativity in particular has that wonderful property of being completely obvious once explained...but that you could almost see before the explanation but yet nobody had previously characterized) and the mind blowing nature of his three-paper year. But when published his work fell on fertile ground.

tgflynn · on Jan 5, 2020

> But when published his work fell on fertile ground.

I'm not sure how fertile that ground was. It was still 3 years after the year he basically laid the foundations of modern physics before he was able to get an academic job.

rad88 · on Jan 5, 2020

This is a general purpose excuse for any rambling theory posted online.

shadowgovt · on Jan 5, 2020

It is a fine line, but it exists. Results make the difference.

scythe · on Jan 5, 2020

To be clear, you’re not disagreeing that the author identifies a problem—you disagree (as I do) that “organized fractally” is not a meaningful phrase that can guide the development of new AI, correct?

mlthoughts2018 · on Jan 5, 2020

I do disagree with the author’s claim. For example, Bayesian ARIMA is absolutely a kind of machine learning model, and can do very well in some long term time series applications.

Posts like these usually want to create a strawman narrow definition of what is allowed to count as “machine learning” and the work backwards to say that subset of models can’t handle some type of problem.

Machine learning is just statistical modeling. To the extent that any kind of statistical modeling adequately solves long term time series inference goals, then so does machine learning.

NOT_A_ROBOT · on Jan 5, 2020

For the past 10 years, I have been called a "crackpot" by friends and family, but It never crushed my motivation.

I moved forward with 5 hours days / 5 days a week while keeping a full time job.

Today I am sitting on a serious "wealth machine".

We have to accept that 98% of human cannot "think outside the box"

skunkworker · on Jan 5, 2020

This article seems out-of-date by 5 years or more even though it was published today, and I am unsure as to why.

It calls out long short term memory but doesn't mention recent (last 5 years) improvements like Gated Recurrent Networks (GRUs) or Transformers (GPT-2, huggingface/transformers) which have shown significant improvements over the traditional LSTM model. These can handle time series data much better than older models could.

joe_the_user · on Jan 5, 2020

I think you need to give details and references to support a claim that these innovations make a fundamental difference.

I don't doubt that the things you mention involve improvements but are these improvements doing better on the same benchmarks in the same fashion or a fundamental change. I read many claims that recent changes in deep learning represent the former.

skunkworker · on Jan 5, 2020

It's to the point now where using a basic LSTM is almost discouraged compared to using a Transformer. GPT-2 wouldn't have been possible without these recent innovations.

Here are some good guides on Transformers [1] and attention/ multi headed attention [2], as well as the paper that proposed the transformer model "Attention is all you need" [3]. GPT-2 heavily relies upon the advancements that transformers brought [4]

[1] http://jalammar.github.io/illustrated-transformer/

[2] https://towardsdatascience.com/attention-for-time-series-cla...

[3] Attention is all you need. https://arxiv.org/abs/1706.03762

[4] http://jalammar.github.io/illustrated-gpt2/

joe_the_user · on Jan 5, 2020

It's to the point now where using a basic LSTM is almost discouraged compared to using a Transformer. GPT-2 wouldn't have been possible without these recent innovations.

So, but that "point" could just be that being a published academic means always using the latest thing. That doesn't demonstrate how much better these are, much less demonstrate that they represent fundamental steps forward.

siekmanj · on Jan 6, 2020

This is mostly true for supervised and unsupervised learning models, but for reinforcement learning the LSTM is king because of the convenient fact that it can be evaluated one time step at a time, instead of just outputting a sequence like a transformer. For things like robotic control, etc, attention-based models are pretty nonsensical.

jeremysalwen · on Jan 6, 2020

Not true, a transformer can be used in models without any lookahead, for example how it is used in gpt-2.! The real difference is the complexity of the model and the large increase in computational cost.

curiousgal · on Jan 5, 2020

Except GPT-2 is irrelevant when it comes to time series specifically.

Erlich_Bachman · on Jan 5, 2020

GPT-2 handles natural language which is specifically a time-series set, a sequence of natural word tokens. It is especially relevant because success in NLP tasks requires tracking and learning long-term relationships - not just between words in a phrase or two, but between words and phrases in different paragraphs, in different parts of the text, so that a general coherence of the whole document is kept. This exactly what is done with long-term relationships.

curiousgal · on Jan 5, 2020

I know that NLP sequences are analogous to time series but GPT-2 can't be applied to say financial time series for example. That's what I meant.

huffmsa · on Jan 5, 2020

Or predicting the performance of a sports team over time.

From what I've seen, It breaks down at the embedding layer, because while the teams "remain the same" in name / dictionary, their actual relative relationship to each other varies season to season / week to week.

NegatioN · on Jan 5, 2020

My response here is purely intuition, since I have never worked much with time series.

But wouldn't capturing that relationship require periodic retraining or other components to the network regardless? It may suggest that end-to-end training of a transformer is not suitable for these tasks, but that it might still capture the prediction of the long-scale time-series, if provided with extra data at each timestep in addition to the embeddings?

huffmsa · on Jan 5, 2020

It does generate a long-term representation, the issue being that even with context and timestep specific data, that embedding is too general to make a good representation.

Sports are particularly problematic because almost all teams and statistics regress to the average at some point, meaning your generated future timestep context clues don't really help modify the embedding.

You're also dealing with variation within a season (injuries, better play, etc) and between seasons (personnel changes, rule changes, new stadiums, etc). So a team might have 4 seasons of above average performance, and then abruptly be the worst team in the league the next because they lost their coaches and star players.

fhars · on Jan 5, 2020

Then where is the publication that uses GPT-2 for a long-term NLP task, like, for example, reconstructing the rules of the great vowel shift from a corpus of all pre 20th century English?

Synchronously parsing the meaning a single text has at one moment in time involves no time series at all.

yorwba · on Jan 5, 2020

How about a model that can detect the semantic shift of "Amazon" from "river" to "company"? https://arxiv.org/abs/1703.00607

They're not even using transformers, just simple word embeddings.

huffmsa · on Jan 5, 2020

The parent comment isn't thinking time-series like you and I are.

Being able to follow multiple agents and correctly deduce their relationships at a given time t is very hard.

NLP "time-series" does a fine job at making back references within a text, but wouldn't be able to have multiple representations of a word or character through the years.

It's very hard to get the computer to say "ah, the context is 16th century, so here are the relationships" without fudging it / tailoring models via tailored corpuses.

yorwba · on Jan 5, 2020

Would adding a context vector for "16th century" be "fudging it"?

huffmsa · on Jan 5, 2020

Depends. In the articles Uber example, the failure wasn't with detection, it as with context switching.

The detection kept changing, and so the model kept going "oh, new object, restart decision process."

Lacking the ability to generate and maintain it's own context is an area where a human would do better. We might not know what the object was, but our "slow down" response wouldn't keep resetting depending on what we classified the object as.

Same as words switching meanings within a piece or sentence. It's hard, but most humans can pickup when the usage changes

laingc · on Jan 5, 2020

I have no issue with you asking for a reference for this, but when I read your comment I did a double-take. To someone familiar with deep learning, this claim is completely self-evident. The article is indeed almost farcically out of date, and LSTMs haven’t been close to the state of the art for years now.

lern_too_spel · on Jan 5, 2020

Even weirder, it calls LSTM a "newly invented variant" of RNN. LSTM is 20 years old.

hnews_account_1 · on Jan 5, 2020

Any links to heavy time series based machine learning algorithms? I'm in finance, and while I know how to establish and run a random forest or gradient boost regressor using standard libraries, I've never had a good handle on them.

cbsmith · on Jan 5, 2020

Most everything Eamonn Keogh publishes: https://www.cs.ucr.edu/%7Eeamonn/selected_publications.htm

throwawaymath · on Jan 5, 2020

Look into matrix profiles and associated algorithms.

joe_the_user · on Jan 5, 2020

This claim seems plausible.

The reason seems even simpler than the article. Deep learning requires lots of training data - that data naturally needs to more or less be "the same"; follow "the same" logic.

A long enough time series is going to involve a change in the logic of the real world, a change that the network won't be trained for.

s_Hogg · on Jan 5, 2020

Deep Learning doesn't necessarily require a huge amount of data. What DL does is allow you to fit more complex relationships between input and output than would be the case with, say, a bog standard linear model. If a relationship is complex and also clearly defined in your data, then you don't necessarily need much. In general it's true that may not be the case, but that doesn't make "deep learning requires lots of training data" true, only that "the data used for deep learning models is typically noisy on top of representing a complex relationship".

It's a semantic difference, but a very important one if we're to avoid going down the road of just mindlessly throwing compute at every problem. And if we do that, we'll just wind up with millions of Rube Goldberg machines instead of actually solving problems.

The change in logic of the real world thing is absolutely spot on, though. Over enough time it becomes basically impossible to disentangle effects.

joe_the_user · on Jan 5, 2020

Deep Learning doesn't necessarily require a huge amount of data.

References? I mean, I know "one shot" and similar approaches but as far as I know, these involve extending a neural network that has been already trained, on massive data, to a little bit more.

dclowd9901 · on Jan 5, 2020

I think this is where the lack of “imagination” that computers currently cannot replicate becomes a huge problem that will set AI and ML back decades more.

Erlich_Bachman · on Jan 5, 2020

What are you even referring to? In which of the tasks that ML is currently applied to do they lack "imagination"? GPT-2 generators have plenty of imagination in generating new phrases and meanings... Alphastar has plenty of imagination of making new moves in SCII that even human players haven't come up with yet... Etc..

s_Hogg · on Jan 5, 2020

They're designed to appear that way, it doesn't mean they actually are imagining anything. We need to be very careful about accidental anthropomorphism when it comes to models 0 seeing an analogy between the output in front of you and the human mind can blind us to the reality of what's going on. Which is often more prosaic, even if complicated.

Erlich_Bachman · on Jan 6, 2020

That sounds like you are operating on your own definition which you can apply and shift however you want. This is not what objective science is about. How do you know real humans are not displaying "accidental anthropomorphism"?

scottlocklin · on Jan 5, 2020

Machine learning does just fine to extremely well at long term time series data; there are entire branches of machine learning dedicated to this. The fact that this imbecile never heard of these tools is why nobody should be reading his essay.

Uber's engineers didn't do this for their human finder because;

1) Image recognition stuff isn't explicitly built to do this (though it easily could be jury rigged to do so)

2) Uber's engineers apparently never heard of the concept of "moving averages" and "threshholds" which would have worked just fine.

"More precisely, today's machine learning (ML) systems cannot infer a fractal structure from time series data."

-look at this idiot using words he doesn't understand. Muh fractals.

justapassenger · on Jan 5, 2020

“Imbecile”, “idiot”? Please refrain from personal insults as they add nothing to the discussion.

scottlocklin · on Jan 5, 2020

"Imbecile" and "idiot" were measured and reasonable adjectives for the gibbering nonsense published above. The drooling lackwit who wrote this should be tarred and feathered for such frippery and nonsense.

As I said above; machine learning does just fine to extremely well at long term time series data; there are entire branches of machine learning dedicated to this.

cbsmith · on Jan 5, 2020

I'm feeling like this is entire missing the whole world of Matrix Profiles and Time Series Chains...

NPMaxwell · on Jan 5, 2020

The article I would like to read is what the challenges are to including a few prior states in navigation. I'm amazed that, when I drive over or under a bridge, my online mapping software changes instructions as if my car were able to levitate 20 feet onto the roadway above or below, even when that roadway is a highway without exit or entrance within a mile.

longemen3000 · on Jan 5, 2020

I remembered that neural differential equations are better suited to represent time series data, I saw them being used a lot in pharmacological processes, any additional idea or insight related to this?