NLP's Clever Hans Moment Has Arrived

_pvxk · on Sept 3, 2019

Great article, and some great links from it too. The list of models "following the letter but not the spirit" at https://docs.google.com/spreadsheets/u/1/d/e/2PACX-1vRPiprOa... has some hilarious and creative examples that one would hope never make it into production:

> AI trained to classify skin lesions as potentially cancerous learns that lesions photographed next to a ruler are more likely to be malignant.

> Agent pauses the game indefinitely to avoid losing

> A robotic arm trained to slide a block to a target position on a table achieves the goal by moving the table itself.

> Evolved player makes invalid moves far away in the board, causing opponent players to run out of memory and crash

> Genetic algorithm for image classification evolves timing attack to infer image labels based on hard drive storage location

> Deep learning model to detect pneumonia in chest x-rays works out which x-ray machine was used to take the picture; that, in turn, is predictive of whether the image contains signs of pneumonia, because certain x-ray machines (and hospital sites) are used for sicker patients.

> Creatures bred for speed grow really tall and generate high velocities by falling over

> Neural nets evolved to classify edible and poisonous mushrooms took advantage of the data being presented in alternating order, and didn't actually learn any features of the input images

hyperpallium · on Sept 3, 2019

And the 2007 evolved silicon that should not work https://www.damninteresting.com/on-the-origin-of-circuits/

prev disc: https://news.ycombinator.com/item?id=18099226

eatitraw · on Sept 3, 2019

> Evolved player makes invalid moves far away in the board, causing opponent players to run out of memory and crash

Well, this sounds like speedrunning. People have found arbitrary code execution vulnerabilities in SNES and used them to jump to the credits (which counts as completed the game) in less than a minute: https://www.youtube.com/watch?v=Jf9i7MjViCE

They also used the same technique to inject runnable code in the game: https://www.youtube.com/watch?v=jnZ2NNYySuE

b_tterc_p · on Sept 3, 2019

Nah that’s not the same thing.

In this case it’s just choosing an option that involves very large numbers because it’s learned that it’s opponent can’t handle large numbers. There’s no code injection

eatitraw · on Sept 3, 2019

The SMB3 ACE is one of the most technically interesting glitches. The usual skips and saves are much more mundane.

My point here is that there is similarity between (some) human players and some AI players. Even the discussion whether exploiting a glitch is actually 'winning' also looks very similar.

fauigerzigerk · on Sept 3, 2019

Some of these approaches would probably be called genius or at least very creative had they been found by a human :)

mlthoughts2018 · on Sept 3, 2019

Yes, some hedge fund would probably be happy to hire these AIs.

inimino · on Sept 3, 2019

Past performance is not indicative of future results.

ethbro · on Sept 3, 2019

That's why successful hedge funds hedge their own pay from clients with legal clauses to limit liability in the case of losses.

Aka "heads, we win; tails, I don't lose"

fho · on Sept 3, 2019

I skimmed through the list and ended up reading the whole article about "World Models": https://worldmodels.github.io/ Definitely a fascinating read!

chronolitus · on Sept 3, 2019

Seeing this work by Schmidhuber suddenly reminded of this song: https://twitter.com/i/status/1155091710281580548 a sort of tongue-in-cheek celebration his famous eagerness when discussing his work.

[edit] Just found out there's a second part: https://soundcloud.com/user-430897588-587876313/lstm-song-2

jimbokun · on Sept 3, 2019

Aren't a lot of these a kind of over fitting the data?

The learning agent discovers something that is true of the training set, but does not generalize to other examples of the problem outside of the training set.

The article talks about much of the problem being in data set construction, because it is very tricky to design data sets without accidental biases that the learner can use to correctly categorize the examples that have nothing to do with the actual problem you want the learner to solve. The traditional techniques to avoid over fitting, like holding out part of the training data, don't do any good if the entire data set is not representative of the real world in some systematic way.

kuu · on Sept 3, 2019

Many of them could be classified as "lateral thinking".

Quite interesting

teekert · on Sept 3, 2019

All signs of true creativity and thinking out of the box ;)

s_Hogg · on Sept 3, 2019

> _"they obviously are not smart enough to import tensorflow as tf"_

Yeah. This is part of why DistilBERT (and the fact that you can do pretty well without BERT) is interesting, to me. It seems like for a very long time there have been people complaining about certain individuals and orgs throwing cash/compute at problems to look good rather than solve anything. The difference is nowadays it's starting to be less of a fringe view, thank heavens.

The most interesting thing about NLP (as someone who works in it), is precisely that it is very, very hard to get anywhere. And that in turn is why the field keeps turning up so many new NN designs: the flipside the author rightly points out is that this has to happen for data as well, if we aren't to fool ourselves about our progress.

Great read.

dlkf · on Sept 3, 2019

> The difference is nowadays it's starting to be less of a fringe view, thank heavens.

Well put. The timing is good too, because NLU-is-just-around-the-corner hype is starting to have some really negative social consequences. Yesterday a disturbing article about automated test scoring in the states was trending on HN.

lonelappde · on Sept 3, 2019

Mass production essay grading was a disaster before AI showed up. AI is just automating the mistakes of the past.

grandmczeb · on Sept 3, 2019

Automatic essay grading has been around since the early 2000s. It has literally nothing to do with the current NLP hype.

inimino · on Sept 3, 2019

Agreed that understanding lags behind when money and cheaper computation (god I hate that term "compute" used as a verb, learn to English my dudes) drives the SOTA rather than actual understanding (on both researchers' and computers' part).

There's a reason we haven't seen a hard takeoff with BERT writing BERT+1 in even better Python and it probably isn't because nobody is trying, it's because BERT is too stupid to write a computer program.

s_Hogg · on Sept 3, 2019

"compute" actually is a verb, though, I thought? I think you mean "noun".

inimino · on Sept 3, 2019

Yes, Muphry's Law strikes again. Noun. Words. Language is hard. Thanks for pointing out the correction!

mirekrusin · on Sept 3, 2019

What if this is the answer? What if our intellect is nothing more than few orders of magnitude more performant network of Clever Hanses in our head developed over years? Look at the history - it's actually hard to find something that is not biased - false, biased believes are spread throughout the timeline of human kind. Given enough time, our knowledge gets closer towards truth - but maybe it's nothing more than a time-trimmed tree of Clever Hanses? Like a child constantly creating false, biased interpretation of surrounding reality, discarding most of it, leaving bits that pass test of time.

inimino · on Sept 3, 2019

Clever Hans learned to copy and mirror the expectations of smarter beings. We are not being watched by smarter beings who are constantly giving us feedback, so there's no analogy there. If we were performing for an audience of super-human intelligences, we would be wise to use all the signal we could from them about what they expected us to do.

> Look at the history - it's actually hard to find something that is not biased - false, biased believes are spread throughout the timeline of human kind. Given enough time, our knowledge gets closer towards truth

How do you think you know any of that?

mlthoughts2018 · on Sept 3, 2019

> “We are not being watched by smarter beings who are constantly giving us feedback, so there's no analogy there.”

Parents, family, teachers, community... ???

inimino · on Sept 3, 2019

"We," as in the human race. (Remember Clever Hans was a horse, being watched by us.)

However, even on an individual level, haven't you ever had a new idea that didn't come from another person but just your own observation and contemplation?

_1qd4 · on Sept 3, 2019

The human race watches itself. It creates ideas of supreme beings so that its subjects will mirror and copy behaviours deemed to be "supreme".

inimino · on Sept 3, 2019

And the human race is not smarter than the human race, which is why we are not in a Clever Hans situation. There's no hidden signal, telling us when to stop, that allows us to fool other smarter creatures into thinking we're as smart as they are when we are actually just mimicking them. We're just humans, learning from nature and from other humans, and that's just culture.

sooheon · on Sept 4, 2019

I think it's comparable, it's just that nature is a much more consistent teacher than Hans' handler was.

throwawaywego · on Sept 3, 2019

> What if this is the answer?

It would be a copout. Instead of actually tackling AI's problem of common sense, claim that maybe layers of logistic regression and matrix factorization is all there is, we are its equal, just a few layers up in abstraction and evolution. Does one really stem and count tokens to decide if a movie review is negative sentiment? Or does one empathize with its writer and build a complete model inside your head?

The horse would be the AI researcher claiming reasoning and understanding from an activation vector trained on word co-occurence on Wikipedia, and the farmer giving clues, is the heated community and industry, mistaking impressive dataset performance for a solution for a problem they're starting to forget.

jcranberry · on Sept 3, 2019

I remember using some kind of software for math problem sets in high school. Some of the kids would just look at the equation, get it wrong, see the answer, and try and figure out the answer to a new version of the problem with generated coefficients. That sounds very much like a Clever Hans solution done by a human. I think what AI is lacking is the mechanism which causes us to reject such a solution, and that's much more complex than just finding the answer and I'm not sure related to the ability to find solutions in the first place.

inimino · on Sept 3, 2019

Can you elaborate on what you mean by "a new version of the problem with generated coefficients"?

jcranberry · on Sept 3, 2019

For example a problem might be solving for the roots of a polynomial, and on each try it would randomly generate a new polynomial with new coefficients.

first problem: find the roots x^2 + 2x + 1

second try: find the roots of 4x^2 - 2x + 4

Or something like that.

inimino · on Sept 3, 2019

Ah, so what you're saying is that they would give up on the problem once they saw the answer, and then move on to a new one, rather than work through the first one until they understood the answer, right?

jcranberry · on Sept 3, 2019

Sort of. The software would give you several attempts in order to get points on the problem. So they didn't really give up so much as they never had any intent to solve the problem in the first place, so much as see whether there was some kind of obvious relationship between randomly generated coefficients and the answer in order to get points on the question.

inimino · on Sept 3, 2019

Ah, I see, sort of gaming the test platform (or trying to) rather than actually understanding the math. So a case of "you get what you measure" and also an example of what happens when you force kids to learn something they have no interest in, perhaps?

jacinabox · on Sept 3, 2019

To my mind the problem is structural. Any algorithmic method the details of which are completely explicable is always going to seem not deep enough to constitute the true core of intelligence. But this doesn't call for not doing the clever Hans thing, it calls for doing it and concealing the fact.

hyperpallium · on Sept 3, 2019

Even our correct beliefs are not learned through their correctness, but monkey-see, monkey-do.

e.g. highschool mathematics

But some people - communities of people - pored over this stuff for centuries to work it out and try to find flaws (and found them). You could have an AI that tries to find flaws - provided you have an actually correct statement of the problem.

That's the classic scifi problem with robots: they do what you tell them to.

leggomylibro · on Sept 3, 2019

I think the trope of 'intelligent' automata being too literal goes back farther than the word 'robot', though. It also sounds a lot like the ancient stories that you hear about things like golems. Remember The Sorcerer's Apprentice, from Fantasia?

It comes across as a warning that we don't really have any solid understanding of what we're doing when we create new and powerful entities, and that pride cometh before the fall.

Plus, any sufficiently advanced magic is indistinguishable from technology :P

sgt101 · on Sept 3, 2019

Which is why we invented science - create theories and use the predictions to generate new experiments that verify the theory, or not. There are two steps there; first the core of science - which doesn't just apply in the lab, but also in scenarios like office politics, it isn't just the clues you have but making situations that give you information that verifies your hypothesis. Second the getting of the idea that we should think this way, rather than just processing the environment as it comes.

inimino · on Sept 3, 2019

It's actually not why we invented science, but it is why we discovered philosophy.

foldr · on Sept 3, 2019

We know that "this" can't be the answer because we know that humans aren't fooled by the inputs that were crafted to fool the language model.

mirekrusin · on Sept 3, 2019

Is it a fair comparison? Give algorithm few orders of magnitude better computation power and as many years of unbounded train data as humans get - and draw conclusions on those results, no?

foldr · on Sept 3, 2019

Researchers are free to do that, to the extent that it's feasible. Until they have, it seems pointless to speculate.

almostarockstar · on Sept 3, 2019

In that specific example, yes. But in general, I think the "we are all Clever Hans" is right at least to some degree.

slavik81 · on Sept 3, 2019

Clever Hans couldn't answer the simplest of questions if the asker didn't already know the answer. If we were all Clever Hans, there'd be nobody left to feed us answers.

Human knowledge was created by humans. We might repeat it blindly too, but somebody had to create it first. That is fundamentally different from what Hans did.

foldr · on Sept 3, 2019

Why? The idea that human intelligence also "fake" is just a convenient excuse for the lack of any real progress in AI. We hear the same thing at the tail end of every overblown AI hype cycle. Well, gee, maybe humans aren't that smart anyway!

pas · on Sept 4, 2019

Why are we equating brute force with fake?

The human mind has many faculties, and the brain has correspondingly many functionally different elements. Plus years of brute force learning.

Compared to GPT2, AlphaZero and whatever we are waay more complex and had a lot more training done.

Probably the super secret sauce is in the hyperparameters that determine how to cobble all those functional components together to get something more than the simple sum of them. (But they are also learned and brute forced through evolution.)

AI needs better brains. (So R&D can proceed faster. We can test new ideas cheaper, we can optimize algorithms, so they'll learn and perform better.)

foldr · on Sept 4, 2019

I'm not sure that I am equating brute force with fake.

ozymandias12 · on Sept 3, 2019

Are you saying we should teach machines to math before teaching them to speak?

vnorilo · on Sept 3, 2019

From the paper discussed in the article:

> Our main finding is that these results are not meaningful and should be discarded.

As often happens, ML found a way to exploit a trivial bias in the dataset. Tip of my hat to the researchers for actually doing a good job! Also really enjoyed this read.

nopinsight · on Sept 3, 2019

In natural language, the number of unique words is large but some of them tend to be highly correlated, which means significant implicit redundancy.

Example: For years, they have strived to make a ....... in the community.

Only a few possibilities out of 30,000-100,000 English words in use can fit properly in the blank.

Thus, many tasks, even those designed to test for real understanding, can be “solved” pretty well by putting together a number of relatively shallow cues. BERT and similar models learn from huge datasets (billions of bytes) and they probably capture millions of those correlations in the model. (These models are indeed wonderful accomplishments and are very useful for many things, but not ultimate solutions to true natural language understanding.)

For more info, see: https://super.gluebenchmark.com/

I agree with the author that we need even better datasets evolved under public scrutiny since some of the datasets we currently have are already designed very well by their authors but the problem of designing datasets that can withstand correlation detections by DNNs (and still amenable to standardized evaluation method) could be too challenging for any single team under limited time.

mikekchar · on Sept 3, 2019

I don't know if I'm just not wired up like most people, but I have real difficulty in fill in the blank tests. These kinds of tests are extremely common for language proficiency tests and I fail at them even for my native English language. For example:

For years, they have strived to make a pizza in the community. (But since they lack flour for the crust, it has not ended well). For me, I can fit nearly any noun into that sentence and imagine a viable scenario where it would be reasonable. I honestly don't know what you were getting at. For others, I am sure that it is obvious and they can tell you the few words you were imagining.

My English language ability is quite good, so I wonder why I can't perform at these kinds of tests. I also wonder if knowing why I can't do this is useful for NLP.

inimino · on Sept 3, 2019

"difference" is the word there.

If you are allowed two words, "real difference".

The way to find it is to find candidate words and try them one by one until you hear something that sounds like you've heard it a hundred times before.

> I also wonder if knowing why I can't do this is useful for NLP.

Yes. I wonder if the inability is acquired or learned or innate? Could you learn to do it? Do you have an aversion to catch phrases and well-used (hackneyed) clichés? Do you prefer to weave your own words into sentences?

I notice you use some odd prepositions in odd orders. For example in your second sentence most people would say "I have real difficulty with fill-in-the-blank tests."

You also wrote "perform at these kinds of tests," which is a place where almost all American native English speakers would use "on" rather than "at".

If I had to guess, you're much less sensitive than the average person to slight variations in word pair frequencies, and you could certainly create a test to test this hypothesis. For example you could choose any n-gram likelihood data derived from well-written English texts, and write a program to measure your ability to distinguish high- from low-frequency word pairs. This should be lower than other people in your peer group.

nopinsight · on Sept 3, 2019

The lack of context does make it impossible to definitely rule out a number of candidates. If most native English speakers were forced to bet $1 million dollars and guess at the missing word written by an unknown author without further clues, I think most would seriously consider only 0.01% of words they know as a possibility.

BERT and similar models do consider a wider context window (up to 1000 lemmas if I recall correctly), which is a reason they do quite well when trying to fill in blanks for a complete passage and usually less accurately for shorter texts.

visarga · on Sept 3, 2019

If the problem is shallow correlations, then we can use hard example mining to clean up the dataset. Train a simple model and skip all examples that are easily being solved. What remains is a harder dataset where simple correlations don't cut it.

DoctorOetker · on Sept 2, 2019

what if most humans exploit the very same Clever Hans effect? A translation could be deemed correct by the submitting human translator(s), where the translation implies less or more rammifactions than the original sentence.

If one thinks about loose associations that newcomers make compared to experts in a domain, this seems very similar to me.

When designing a vector graphic one can enable "snap to grid", similarily at some point we will have to "snap to makes sense" by means of verifiers or provers. "Why" type questions ultimately ask for a proof or derivation, which in the past (before the advent of logic, to which a philosopher would have to carefully abide) was not objectively verifiable, but at some point it is foreseeable that neural networks will be asked to construct (or append) a formal belief system, predict a conclusion, and justify it by supplying a proof.

liuliu · on Sept 3, 2019

Humans already do. Many SAT and GRE prep schools (New Oriental) taught statistic cues for these questions. Simpler cues could be picking "the longest answer in the 4 choices" or eliminate "the one most dissimilar to the other choices".

DoctorOetker · on Sept 3, 2019

I agree certainly about intentional cheating, but I meant that if you pay attention you often see other people unintentionally using such cues, where they often get the right answer for the wrong reason. (And by symmetry we are probably all "guilty" of this to one extent or another).

Especially in how people think machines work: the user facing interface is optimized for these naive interpretations, while the true operation is hidden because of complexity that "would confound us".

inimino · on Sept 3, 2019

There is no cheating in using the structure of the answers to eliminate incorrect decoys. This is something some kids become consciously aware of by themselves. Others are taught these techniques as "process of elimination" which is certainly not cheating on a multiple choice quiz. If you've ever designed a test protocol for standardized testing of vocabulary or grammatical ability, you're very aware of all these issues.

chucksmash · on Sept 3, 2019

It's not cheating, but it does circumvent the assessment value of the test imo. The quote in the article about shallow heuristics can totally be appropriated:

> [Good test takers] are prone to adopting shallow heuristics that succeed for the majority of [tests], instead of learning the underlying [facts] that they are intended to [assess].

Semi-related, I wonder where one draws the line on which heuristics are deep vs shallow in test taking. If a question assesses vocabulary knowledge by asking you to pick an appropriate word from several choices, "these are words I've seen before even though I don't know what they mean" && "here's a choice whose Latin roots plausibly give it the right meaning" seems shallow to me. It allows the test taker to select the right answer without actually knowing it. At the same time, it's not shallow in that you're incorporating quite a bit of knowledge of the language as well as the context you've developed over years of reading.

---

As soon as the conversation here turned to test taking, the thread immediately reminded me of the scene in The Wire where the students pick the right answer from the blackboard by seeing one of the choices has a lot more stray chalk marks beside it, indicating it was pointed at during the previous period.

"The answer is B five. B five's got all the dinks"

inimino · on Sept 3, 2019

True story: I didn't bring my calculator with me when I went to take the SAT Math subject test. Oops. However, I was already sitting there, and had already paid, and they told me I could take any SAT II test I wanted, so I randomly took the Biology test and got 760 out of 800 on it. Not surprising, you say? But the thing is, I had never taken a day of biology.

The point is, good test design is actually hard.

Paul Nation's "Vocabulary Size Test" has a good approach to the challenge of determining how many words someone knows. Even with this relatively simple-to-state metric, there's a good bit of subtlety in getting a defensible result.

https://www.victoria.ac.nz/lals/about/staff/publications/pau...

perl4ever · on Sept 3, 2019

"When designing a vector graphic one can enable "snap to grid", similarily at some point we will have to "snap to makes sense" by means of verifiers or provers."

Yes, I think this is how the human mind works, kind of, but I don't think the part that provides the grid is like the verifiers or provers we have implemented. There is something that provides a substructure, and I think that you can see in mental illness where it's not functioning properly, but even when healthy, it's not that logical. When humans do logical reasoning, that's a very high level activity, superimposed on top of the other layers, IMO.

I think that the "snap to" part is going to be very difficult to develop because we are not conscious of it. Where to even begin? It might be fruitful to study instances where it isn't working properly - like thought disorders in schizophrenia.

So yeah, I think human thinking may have severe flaws that are rather similar to the ones discussed in the article, but that doesn't prove that current AI has all the components needed to match humans.

inimino · on Sept 3, 2019

> I don't think the part that provides the grid is like the verifiers or provers we have implemented.

Certainly not. But it is similar in that confidence increases once one decides to believe in something.

> I think that the "snap to" part is going to be very difficult to develop because we are not conscious of it. Where to even begin?

Begin by becoming conscious of it. Learn to recognize when you have made a decision, or a judgment, learn to feel what that feels like when it is just about to happen, and then pay attention to what leads up to it. Is it some kind of subconscious ratiocination? Then slow it down and figure it out consciously. Is it a subconscious review of some sense data? Then review it with higher awareness. Intuition and introspection will teach more than trying to watch it break.

DoctorOetker · on Sept 3, 2019

>So yeah, I think human thinking may have severe flaws that are rather similar to the ones discussed in the article, but that doesn't prove that current AI has all the components needed to match humans.

I adamantly agree, there is probably a lot of generally applicable (i.e. not domain-specific) "tricks" or "implicit insights" that mammal brains use which we haven't discovered yet.

Another good example is, how it's often stated that humans have no trouble learning things "in one shot", but how on closer inspection that may or may not be true and is very hard to verify:

Consider how when we have a clear negative or positive experience (say being thrown out of class or standing in a corner facing the wall versus getting a compliment on your work etc), we afterwards typically replay the sequence of events and how it led to the current situation. It is unclear if this replaying occurs only for the highest-level abstract thoughts possibly centralized into a couple of regions recording and replaying episodic memory, or if this is actually happening in a more decentralized way throughout the brain while we are only subjectively aware of this episodic aspect at the conscious level. If this self replaying of the last lessons that apply locally is spread and operating independently throughout the brain, could this be an explanation of the brain wave patterns? Is this origin of the "aha" signals or is that just a fantastic concoction of the reproducibility crisis?

Local feedback signals (numbering on the order of number of synapses) vs global feedback signals (numbering on the order of kinds of hormones, diffusing neurotransmitters, blood sugar etc) ===

We still know very little of the feedback mechanisms in the brain, is it a low number of global feedback signals, or a large number of local feedback signals?

A) global feedback: the hardcoded feedback mechanism is primitive wetware (listening to the low number of chemical signals) while the high-speed feedback is learned by neurons influencing each other in the prograde direction as emergent behaviour resulting from this primitive wetware (the only feedback being through axons literally feeding back as opposed to feedforward networks); or

B) local feedback: if the high-speed feedback is hardcoded wetware too, i.e. retrograde signalling of adjoint derivatives across the synapse, which would mean that the reverse accumulation automatic differentiation we use is closer to biology than currently accepted.

For example in A) suppose there is a low bandwidth (low frequency) feedback reward signal, say blood sugar rising after eating a given sweet for the first few times, then the shortest path between the taste buds and the chewing and swallowing motor neurons would improve their weights for intake, a first level anticipation, but then later seeing or touching the sweet or actively seizing one might cause the previously trained neurons to train the newly recruited signal paths by some other neurotransmitter reward by anticipation.

Fast-approach-but-imprecise adaptation vs Slow-approach-but-precise fine-tuning ===

Another facet is that we currently have "1 training regime" for digital neural networks, with which I mean: we use gradient descent both to 1) adapt the weights from a totally randomized initial state to a somewhat acceptable state, and then after a nonexistent pause, 2) to continue improving the weights near the top of the hill (or bottom of valley...) of the scoring landscape. We have no guarantee that nature uses a single regime, let me concoct a hypothetical (and thus improbable) guess: perhaps it uses gradient descent to get the weights approximately where they should be, but then a per-synapse lock-in amplifier correlates (multiplies over time 2 signals while summing) the positive / negative feedback with its own positive or negative variation on the weight, smoothes this product-sum correlation signal (low pass filtering) to get a very high precision feedback signal which has higher precision than a single instantaneous "sample" of feedback (local or global). Throughout the brain synapes some would be closer to the instantaneous feedback signal regime (in response to new lessons. or for short term memory), while others would be closer to the LIA regime, for longer term memory and / or precision. That would be trillions of lock-in amplifiers in a single human brain...

EDIT: in fact SGD stochastic gradient descent can already be seen as a global LIA mechanism, but with all weights / synapses using the same low pass filter. In theory we could give each synapse / weight not only a weight but also each their own timescale (or example count scale, the tau in (alpha) and (1-alpha) multiplier factors when expressed as a filter), and adapt the timescale depending on the local feedback signal (adjoint derivative) in the backpropagation algorithm. To vary the weight just use weight times (1 + 0.001 times the randombit bit r), and multiply it with the feedback signal (the component of the usual gradient, i.e. the derivative that corresponds to this synapse / weight). A similar trick could be used to vary the timescale. One could also hardcode the timescale for certain synapses as a hyperparameter to forcibly locate short term vs long term memory pathways.

EDIT2: As an example of the function of weights that would settle (or were forced) at long vs short timescales at low or high levels:

low level, short term: adapting to reverb entering or leaving a church, adapting to noise when entering or leaving a party,

low level, long term: associating spectral peaks among frequency bins (the fundamental and higher harmonics for sounds of different timbres would remain in roughly the same place, regardless of background noise)

high level, short term: unimportant small talk, or important but quickly dealt with information (did I already pay this drink or can I walk away?)

high level, long term: what is my pin code?

inimino · on Sept 3, 2019

> Another facet is that we currently have "1 training regime" for digital neural networks, with which I mean: we use gradient descent

Speak for yourself! There are plenty of other ways. Gradient descent is just currently in fashion.

DoctorOetker · on Sept 3, 2019

sure there are other means of training, I was highlighting that nearly all approaches use only one improvement approach during training (gradient descent, genetic algorithms, ...) and if multiple are used the whole model is using the same balance of multiple approaches.

inimino · on Sept 3, 2019

Ah then agreed. My take is if you have a feedforward network, add time delays, because synapses aren't running in lockstep, and add connections arbitrarily with any two neurones, only limited by the physical distance between them, and then add in every kind of diffuse effect of neurotransmitters and all the brain chemicals we haven't discovered yet, and then figure out how to train it... you still won't have a human brain equivalent, because it is harder than that. But you might get something that can learn! All you need is a dopamine reward button, and some way to whack it with a stick every now and then.

I think we're more likely to create AGI than to understand how the brain works in our lifetimes.

j88439h84 · on Sept 3, 2019

> When designing a vector graphic one can enable "snap to grid", similarily at some point we will have to "snap to makes sense"

I love this idea, thank you for sharing it.

DoctorOetker · on Sept 3, 2019

My post wasn't very coherent in hindsight, but I intended to more explicitly suggest that I predict that at some point we will replace / augment supervised training for abstract reasoning tasks with "unsupervised" feedback from provers or verifiers (which can objectively verify if a conclusion makes sense). From the perspective of the neural network it is supervised by the prover or verifier software, but from the perspective of us humans we don't choose when a proof checks out or not in a formal system, mathematics does.

EDIT: if you are into ML and like this idea, certainly check out MetaMath, the book is very accessible, and the author (Norman Megill) is extremely friendly and helpful. It takes perhaps a few days to a week to learn, study, and reimplement the verifier. One does not at all need to actually study the whole of the set.mm database (i.e. all the theorems and proofs) to implement the snap to makes sense I propose above. It simultaneously gets rid of the expensive labeling for correct vs incorrect proofs on one hand, and is a path to self-explaining AI on the other. I can not at all exclude that "soon" AI will be proving math conjectures faster than humanity can conjure them. That would be quite an interesting world! At first it would be trained on generating proofs for propositional logic statements (the challenges could be generated by a prover doing a "random walk"), then first-order logic / set theory / numbers.

inimino · on Sept 3, 2019

If it can't come up with new ideas and also get bored, it will suck at finding proofs!

DoctorOetker · on Sept 3, 2019

why do you assume this is necessarily true? consider how word2vec results in analogies among discrete symbols, lots of theorems, proofs, functions,etc are analogies of previous ones.

inimino · on Sept 3, 2019

I'm not assuming! I know it.

Word2vec results in similarity scores you can use to find analogies. Great! Now how do you know which analogies to look for? Which ones to remember? Which ones to pursue? You need some kind of intuition for where to look. So you need (1) a way to move around the search space and (2) a way to know when to backtrack, or when a line of inquiry is unfruitful, and this last is precisely the capacity for boredom. If you can't come up with new ideas and you can't get bored, you'll suck at finding proofs just like you'll suck at math or problem solving in general. There's also a danger in getting bored by the wrong things. So how do you develop this boredom intuition?

We choose what to think about, but until we make conscious machines, we are probably the only creatures that have to make this choice.

DoctorOetker · on Sept 3, 2019

you don't use the similarity score to extract analogies from word2vec, it's offset vectors, linear relations the infamous "king"-"man"+"woman"=~"queen" (with reasonably low perplexity)

extracting the analogies is not that hard (a naive brute force is looping over combinations of 4 words and testing how close the relationship holds), but more importantly, one doesn't need to extract the analogies, the neural networks utilize these analogies implicit in their embedding.

I have never seen the boredom of neural networks investigated, the closest concept that comes to my mind is surprisal, which is widely understood since Shannon & information theory...

inimino · on Sept 3, 2019

Right, so the point is, you have your vectors and now you have a way to find 4-tuples representing analogous sets of words. Great! You built an analogizing machine! Now what do you do with the analogies?

Neural networks can't get bored because they don't decide what to think about. They are like a total functional programming language with no control flow constructs. This is why they always give an answer in the same amount of time, regardless of the input vector.

Surprisal is only a measure of information (given a distribution) and is only distantly related to what I'm talking about. However, if a neural network could choose to think more, choose to get new data and reconsider, choose to go back and look at that one from a couple minutes ago... then you'd also want it to have the ability to get bored. AKA to recognize when it is spending resources on an unprofitable line of inquiry.

nl · on Sept 3, 2019

You can see some examples of this heuristic thinking in "Thinking, Fast and Slow": https://en.wikipedia.org/wiki/Thinking,_Fast_and_Slow#Two_sy...

PeterStuer · on Sept 3, 2019

Great article. Should be required reading for test designers as humans also tend to pick up on the implicit heuristic clues that bypass the need for knowledge and understanding as prep schools teach "don't know the answer? These 5 simple rules will improve your guess"

tonypace · on Sept 3, 2019

The standardized test organizations give briefings to teachers on the most common Hans cues for students. In order, they are differing answer length, answers that mirror the question's grammar, and the infamous letter C.

PeterStuer · on Sept 7, 2019

I had to Google for the 'infamous letter C'. In case others were wondering, it seems to stem from folk wisdom still perpetuated that for a blind guess the third answer when given 4 or more options in a multiple choice test is more likely to be correct than the others.

The linked article suggests this does not hold (for the ACT), but doe suggest picking a single letter and sticking with it for all blind guesses would outperform a purely random guessing strategy.

https://blog.prepscholar.com/most-common-answer-on-act

js8 · on Sept 3, 2019

I have said it before and I will say it again - the problem is that humans are MORE than just pattern recognizers. We build models that we are trying to make logically consistent; while the function that provides pattern recognition doesn't have to be.

To see that better, consider what I call "simplified Chinese room". It's a variation on a traditional Chinese room, where inside the room, there is only a pattern recognizer, which basically will match the input to arbitrarily many inputs (but not all possible) it learned before and chooses the output for the best match.

Now imagine I want to train this "simplified Chinese room" on solving satisfiability problem. Because in that problem, an arbitrarily small change in the input (introducing a contradictory clause) can completely change the output. It is therefore impossible, I believe, to learn the concept of satisfiability by just using pattern recognition (storing and comparing, according to some metric, previously seen inputs and corresponding correct outputs). Instead, you need to build a mental model which is internally self-consistent with these example pairs.

visarga · on Sept 3, 2019

> Instead, you need to build a mental model

<rant>That's where graph neural nets come into place. They can learn relations, scaling to a large number of objects. A traditional approach would have to learn all possible combinations, hitting the combinatorial explosion. Graph neural nets can solve problems such as shortest path, sorting and dynamic programming. I think in the future if we are to get closer to human level we need graphs as the intermediate representation. Graphs could represent the objects in an image/phrase and their relations, then answer about the attributes of an object, the relation between two objects or classify the graph itself. All simulators are evolving graphs as well, and code/automata could be represented and executed as a graph. The transformer could be considered an implicit graph where the adjacency matrix is computed from the nodes at each iteration. The closest to AGI in my view would be model based RL implemented with graphs.</>

ozymandias12 · on Sept 3, 2019

Amazing rant.

But what bothers me is that the human animal is trying to create an artificial human mind, the most amazing piece of meat we have laying around.

If we could create the reasoning of a dog first, to then upscale to more complex logic, I'd be more confident research is going places (as in, we have biological evidence building blocks to understand first).

DoctorOetker · on Sept 3, 2019

Deciding consistency is a ridiculously hard next bar to set though, lots of NP-hard problems have very useful human-crafted heuristics nowadays. Similarily AI-crafted heuristics should be more than possible.

js8 · on Sept 3, 2019

I agree, but at the same time you're kind of missing the point. Adding more and more examples to the pattern recognizer is like adding more and more epicycles (for dealing with special cases) into the theory of celestial object movement, where a more general and simpler theory will do a better job.

But it's true we don't precisely know how to "find and represent the better theory", if it's at all possible (most likely not) and what the exact trade-off is.

Also note that my argument doesn't necessarily rely on computational difficulty of SAT - even if we limit to some "simple" subset of SAT instances, we might see that the pattern recognizer is unable to learn these problems correctly.

lostmsu · on Sept 3, 2019

And you are wrong. Consider a logic task of deciding !!!!!....!!!!!false with 100-200 !s. Can you solve it without iteratively applying pattern recognition? What if it is 1kk-2kk of !s?

js8 · on Sept 3, 2019

I didn't claim that pattern recognition is useless, I claimed that by itself is not enough. And you can see that your model uses it in the calculation iteratively, that is, arbitrary number of times. It is not just giving out the result that is closest to the database of stored results.

lostmsu · on Sept 3, 2019

Yet my argument exactly is that the logical abstraction you were talking about is a relatively simple pattern, and what is beyond it less important.

The author is right that what we do for NLP is not enough for reasoning. But it does not mean that we are too far from it.

Even in your variation of the room experiment, if the person in it could reply something along the lines of "here is what I think is a part of solution", which you simply had to feed back to eventually get the whole thing, would you claim it is a weak abstraction?

runT1ME · on Sept 3, 2019

Wonderful article that both talks about overall problems and at the same time gives concrete easy to understand ways to test your model.

miltondts · on Sept 3, 2019

I suspect a human trained exactly like this and with zero previous knowledge would have have the same or worse performance. What is the surprise here? That deep learning is not magic? Do people working in A.I. not know how animals and humans learn?

draw_down · on Sept 3, 2019

> Without getting into a Chinese Room argument about what it means to understand something,

Yep, probably best to avoid. I still don’t think I’ve seen a convincing rebuttal.