Exploring Weight Agnostic Neural Networks

elamje · on Aug 29, 2019

Unpopular quote from my image and video processing professor - “The only problem with machine learning is that the machine does the learning and you don’t.”

While I understand that is missing a lot of nuance, it has stuck with me over the past few years as I feel like I am missing out on the cool machine learning work going on out there.

There is a ton of learning about calculus, probability, and statistics when doing machine learning, but I can’t shake the fact that at the end of the day, the output is basically a black box. As you start toying with AI you realize that the only way to learn from your architecture and results is by tuning parameters and trial and error.

Of course there are many applications that only AI can solve, which is all good and well, but I’m curious to hear from some heavy machine learning practitioners - what is exciting to you about your work?

This is a serious inquiry because I want to know if it’s worth exploring again. In the past university AI classes I took, I just got bored writing tiny programs that leveraged AI libraries to classify images, do some simple predictions etc.

amelius · on Aug 29, 2019

ML is more like growing crops than it is about "designing stuff". Growing crops is slow, and you don't know beforehand what the result will be. However, you can still throw a lot of science at growing crops ("plant breeding" is a science), and the same holds for engineering.

dna_polymerase · on Aug 29, 2019

Or you throw ML at growing crops, much like they throw ML at ML these days (https://ai.googleblog.com/2017/05/using-machine-learning-to-...).

jszymborski · on Aug 29, 2019

This might be pedantry, but there are far sillier things to apply ML to than ML itself, despite the initial sound of the thing.

meowface · on Aug 29, 2019

I'd highly recommended watching this podcast interview between Lex Fridman and the creator of fast.ai, released yesterday: https://youtu.be/4CTDdxfSXF0

He covers a lot of relevant and interesting topics, including how he tries make it less of a black box when designing their courses, and also how it has the potential to confer increased rather than decreased insight into what's going on in a dataset.

I don't personally know much about ML, but I think even though there will still be probably an opaque aspect in many cases for a while to come, immense value will still be continually gleaned, as long as people are aware of the limitations. If you accept something is a black box and don't oversell it, a black box is better than no box.

All of our own brains are far more of a black box than any deep learning model, in many capacities. But we still use it daily for meat-machine learning, to great success, and can still tune the parameters a bit to improve outcomes, even if we very often don't really understand exactly what we're tuning or why it seems to cause certain effects for some brains (or why it doesn't have those effects for other brains). Consciousness may be the biggest black box of them all, but here we are all talking and making nearly non-stop use of it.

I agree it's very important to try our hardest to reach a deeper understanding, but it's kind of like psychiatry vs. neuroscience, or experimental quantum physicists who "shut up and calculate" vs. theoretical quantum physicists who actually want to know what's really going on here at the most fundamental level beyond the useful black box of quantum behavior. While we're trying to solve the hard problems of deep understanding, we can make practical use of what we have in the meantime. We need both kinds of fields and people.

If future AI architectures and ideas lead to some degree of convergence with a biological brain, I wonder if the black box problem might become amplified. Maybe one part of the solution is to not use the brain as the model to aspire to, and to eventually seek out alternative avenues to higher, and eventually general, intelligence? (Or maybe I'm completely talking out of my ass, because I'm not at all a researcher or practitioner. I'd appreciate any input from experts.)

kuu · on Aug 29, 2019

For me, the exciting part is to understand how the "blackbox" can learn and to find a representation of the data that makes this box to learn.

For instance, I've been working in users profiling and it's been a challenge to find which features and in which representation allow the model to learn. It's fantastic when you make a little change in a feature (for instance, use the median instead of the mean) and your model suddenly gets a +5% acc.

The field in the real world is not as simple as to make an API call. The data in the wild is really complicated. To get value from this data is the real challenge.

pierrebai · on Aug 29, 2019

"replace the mean with median"

This illustrates the black-box aspect of it. You changed something and the results are affected but you don't know why. Median has a built-in implicit filtering (it's not affected by extreme outliers like the mean), so it could simply be that you needed to filter your inputs. But won't know, because... black box.

kuu · on Aug 30, 2019

and the results are affected but you don't know why

Well, that's just your assumption without knowing the exact problem and I think you are missing my point.

You can approach to data preparation by randomly changing things, and maybe you can get some interesting results but I promise you you will fail many many times. Other way is to know what does it means to change the mean for the median for instance (as I mentioned in this just random example), and I promise you will find better solutions.

The idea is not just "change and test" and see what happens. The interesting part is to understand how the model uses your representation and why one is "better" than another.

drewm1980 · on Aug 29, 2019

"what is exciting to you about your work?" Seeing the black boxes solve problems I know I could not solve manually. If you go into industry, expect to be spending the vast majority of your time wrangling data, integrating the black boxes into other software, and apologizing repeatedly to managers because you have no idea how long it will take to make your black box work, if it ever does. Trying to make plans around AI based software is an exercise in futility.

specialist · on Aug 29, 2019

I've worked with blackboxes (special sauce) which didn't actually do anything (useful).

Free career advice:

Don't be the boy in the parable of the emperor's clothing. You will be punished. Just smile and nod, say vaguely positive stuff, get your job references, and quickly find a new gig.

didibus · on Aug 29, 2019

I'd suggest you look at Probabilistic programming languages: https://en.m.wikipedia.org/wiki/Probabilistic_programming

As someone who like you find ML boring, PPL is much more fascinating, because it brings the programming back into AI. It's more like the logical evolution of logic based rules system, adding bayesian probabilities to it.

outlace · on Aug 29, 2019

I’m most excited about what is now being called scientific machine learning, ie machine learning models explicitly structured to learn interpretable models that can be used to understand some scientific domain better. For example, I’m starting to work on using graph RNNs to study dynamical behavior of reinforcement learning agents and how it might give us insights into psychiatric disorders.

kk58 · on Aug 29, 2019

I'm interested in reading more about this area of work. Can you share your project page if it exists or any foundational papers in this space

outlace · on Aug 29, 2019

My inspiration comes from this: https://arxiv.org/pdf/1809.06303.pdf

There's a long history of using differential equation models to model high-level brain function (see also neural mass models and neural field models) but I think there are advantages to using discrete time approximations such as neural networks in an RL setting to investigate how the dynamics (e.g. attractor states, etc) map onto behavior.

nabla9 · on Aug 29, 2019

Your professor was quoting/paraphrasing Alan Perlis in reverse.

Epigram #63

63. When we write programs that "learn", it turns out we do and they don't.

--

SIGPLAN Notices Vol. 17, No. 9, September 1982, pages 7 - 13.

Y_Y · on Aug 29, 2019

Nice reference, but I think that claim is opposite to the professor's one.

nabla9 · on Aug 29, 2019

you are correct.

SequoiaHope · on Aug 29, 2019

I’m not a heavy practitioner but as a robotics engineer being able to use existing algorithms to perform previously difficult tasks is exciting. I’m also hopeful that the coming decades will bring a lot more to robotics software as machine learning research continues.

One thing I’ve been learning is that the black box nature of machine learning algorithms is partially a myth. A lot of tools have been written to help explain models. However I’m a mere novice and student so that’s just something I’ve heard. Would love it if a skilled practitioner chimed in.

ma2rten · on Aug 29, 2019

There has been a lot of research in understanding neural networks and making them less of a black box. If you classify cat from dog videos on YouTube, it doesn't matter if you make a mistake every now and again. But if you want to build a self-driving car or make a medical diagnosis, you better be able to explain which your network made a certain decision.

bonoboTP · on Aug 29, 2019

I'm hearing this in the last few years quite often but I'm not sure what kind of explanation you mean.

The x-ray image was incorrectly misdiagnosed, because... What type of thing should come here?

... it didn't look like the other class. ... it didn't have this weird smudge thing on the top left in which case usually there should be a little hazier blob in the middle, except when the pointiness of the thing that is to the right of the brightest blabla...

You get the idea, my description is even exaggerating the nameability and describability of these structures. You'd get a long and complex description at best, because simple models don't work for pattern recognition. But even if you made sure to just use well understood features like edge thickness, angles, sizes of connected components etc, how would a boolean formula of a hundred such terms be helpful in court or wherever you want to use these explanations?

ma2rten · on Aug 30, 2019

For example there is a concept visual attention. You plot which areas of the image the model pays attention to when making it's decision.

erikpukinskis · on Aug 29, 2019

Does the FDA require you are able to explain how drugs work? Or do you just have to show their efficacy and safety in trials?

igorkraw · on Aug 29, 2019

One thing that is exciting to me is that machine learning is the next level of abstraction in our tools for thoughts, giving us ways to deal with fuzzy and ultra-high-dimensional problems.

Machine learning can be viewed through optimization, probablity and information theoretical lenses, and each of them will give you understanding of a different aspect. One recent "click" I had in my head came from debating with colleagues who are really strong in optimization: they routinely define values, sets, functions etc. as the end result of an optimisation problem. And this makes total sense.

Going down to maths that most people might be familiar with from highschool, think about how you define a line in space, or a plane. You write down "y=f(m)=a+mx" for a line (which you could read: to find a point m times the length of x away from the anchor point of a line, start at the anchor and move m times along vector x") "P={x:dot(w,x-x0)=0}" for a plane (read: the set of all x for which the dot product is 0). These are two different forms of describing the objects, one which is expressed as a function* using the constraints placed on the object to move along on it, and the other simply as a concise way to express all constraints. You can move from one form to another for both objects, and finding these types of connections increases your understanding of these objects and how their constraints define their structures.

Now, optimisation problems are similarly defined by their cosntraints,but they are much more flexible and connected to reality. We describe constraints in the hierarchy of convex (LP, QP, SOCP,SDP, conal) and nonconvex ( aka. "fucking difficult to deal with") constraints and then develop optimisers that attempt to fulfill these constraints as best as possible given some data. If it's solvable, then we have a way to reason explicitly about very complex properties the data and the mathematical objects contained in it - which can be used to solve actual problems, like finding the pareto frontier of a decision space, or laying out an UI nicely.

The machine learning takes this and pushes it to it's extreme, finding ways to navigate extremely high dimensional spaces and reason about the objects contained therein. Things like word vectors, latent representations and the "natural image manifold", finding ways to discover the "separation" of classes by building generative models and measuring distances in the latent space, visualisation techniques like curch plots, TSNE ...all of this extends our ability to reason about and understand data and ultimately our world.

That's one thing that excites me at least

baylearn · on Aug 29, 2019

Previous discussion (about the actual research article at https://weightagnostic.github.io/ rather than the blog post):

https://news.ycombinator.com/item?id=20160693

antpls · on Aug 29, 2019

How is it different than pruning a neural network?

It seems you could train the weights of a state of the art NN, then quantizite it, then prune it. It will remove some weights of the NN, then all the remaining weights are set to the same value. Isn't training then pruning more efficient than using an architecture search algorithm ?

p1esk · on Aug 31, 2019

then all the remaining weights are set to the same value

During quantization, weights are set to different values, in the extreme case just 2 different values (binarization). In this case, they are using multiple activation functions to provide various paths for signals to be modified (effectively playing the same role as weights).

drewm1980 · on Aug 29, 2019

At the risk of broad oversimplification, pruning trains and then does an architecture search. This does an architecture search and then trains.

p1esk · on Aug 31, 2019

No, here the architecture search is the training.

phaedrus · on Aug 29, 2019

I wrote a series of Markov chat simulators as a teenager. Often I used a simpler algorithm which ignored the probability weight (all out-links, once learned, given equal probability). These version performed subjectively as well as, if not better, than the versions which tracked the weight of links. I'm not surprised therefore that weight agnostic neural networks can work, too.

meowface · on Aug 29, 2019

I think it may not be a great comparison. N-grams (of words) of human speech/writing are way more deterministic than the kinds of things ML usually tries to tackle, I think. If you write the word "because", then "of", "the", or some pronoun are all extremely safe bets for the next word, regardless of their recorded probabilities. I imagine you could also totally randomize the probabilities and not see any issues.

But I'm no expert and hardly even an amateur, so maybe it is a similar kind of thing here with ML. And I know randomized optimization is a big thing in ML, though I'm not sure to what extent that could be analogized with randomizing Markov model probabilities.

jangid · on Aug 29, 2019

The analogy given in the article is interesting. Some organisms perform certain actions even before they start to learn. I myself have seen some animals start running immediately after birth. Less number of parameters (shared parameters) could also be thought of as less complexity and hence less processing power requirements; which implies faster training. Phew! too much similarity.

nurettin · on Aug 29, 2019

https://github.com/google/brain-tokyo-workshop/tree/master/W... to me, this is the really interesting part of the article. NEAT (neuro-evolution of augmenting topologies) is an algorithm for GANN. For those who are looking to implement the algorithm from scratch, see http://nn.cs.utexas.edu/downloads/papers/stanley.ec02.pdf for hours of fun.

ilaksh · on Aug 29, 2019

This seems like it has the potential for massive efficiency gains and maybe could help with better generalization if the much simpler networks could more easily be reused or recursed or something.

kolar · on Aug 29, 2019

How is this different from genetic programming?

choppaface · on Aug 29, 2019

The authors draw from traditional genetic algorithms (NEAT http://nn.cs.utexas.edu/downloads/papers/stanley.ec02.pdf ) but this appears to be one of the first papers (at least recently) where proposed architectures are not trained but rather given random weights. The authors here try to qualitatively distill the role of architecture (vs optimization) in neural net research.

The work here is in a similar vein as the Lottery Ticket Hypothesis ( https://arxiv.org/pdf/1803.03635.pdf ), which found that deep nets (for vision) contain discriminative sub-networks at initialization time (due to random initialization), before training ever starts.

While the authors of this work on architecture search say they hope to inspire the discovery of new architectures, a more immediately striking result of their work is that they get functioning systems from doing something “stupid” (i.e. not optimizing weights).

kolar · on Aug 30, 2019

Genetic Programming tries to do almost the same: ie. generate random programs/expression trees etc and then select the best one. NN gives you two adventages over that: you can optimize weights and do it at scale with current hardware - without those two I can't see the difference.

zwaps · on Aug 29, 2019

How was early Machine Learning different from statistics?

New names makes things exciting for people to oick up. Who wants to estimate multinomial regression when you can learn a shallow softmax activated neural network!

Its all about creating hype.

YeGoblynQueenne · on Aug 29, 2019

>> How was early Machine Learning different from statistics?

Some of the very early work in machine learning, in the 1950's and '60s was not statistical. The first "artificial neuron", the Pitts & McCulloch neuron, from 1938 was a propositional logic circuit. Arthur Samuel's 1952 checkers-playing programs used a classical minimax search with alpha-beta pruning.

Machine learning in the '70s and '80s was for the most part not statistical, but logic-based, in keeping with the then-current trend for logic-based AI. Early algorithms did not use gradient descent or other statistical methods and the models they learned were sets of logic rules, and not the parameters of continuous functions.

For instance, a lot of work from that time focused on learning decision lists and decision trees, the latter of which are best remembered today. The focus on rules probably followed from the realisation of the problems with knowledge acquisition for expert systems, that were the first big success of AI.

You can find examples of machine learning research from those times in the work of researchers like Ryszard Michalski, Ross Quinlan (know for C4.5 and IDR and the first-order inductive learner FOIL), (the) Stuart Russel, Tom Mitchell, and others.

YorkshireSeason · on Aug 29, 2019

   How was early Machine Learning 
   different from statistics?

I'd argue: in two ways.

First: ML's algorithmic focus. Just about anything in modern AI/ML works because it uses compute at extreme scale. For example neural nets seem to work well only when trained with huge amounts of data. Statisticians lacked the background to make this happen.

Second: most work in statistics assumed that data was generated by given stochastic data model. In contrast, ML has been using algorithmic models and the data given by an unknown mechanism. In most real-world situations, the mechanism is unknown.

It's not just hype. Statistics was stuck in a local optimum, and it was ML's focus on algorithms, data structures, GPUs/TPUs, big data, ... together with the jump into 'weird' data (e.g. the proverbial cat photos), that propelled ML ahead of statistics.

throwawaywego · on Aug 29, 2019

> Statistical Modeling: The Two Cultures (2001), Breiman

> There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. The statistical community has been committed to the almost exclusive use of data models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems. Algorithmic modeling, both in theory and practice, has developed rapidly in fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools.

http://www2.math.uu.se/~thulin/mm/breiman.pdf

theferalrobot · on Aug 29, 2019

There are completely non statistical learning algorithms (many at that) which is part of why the distinction is needed. Stats are definitely crucial in parts of the ML domain but not everywhere. Another related reason is just one of focus, where ML doesn't care how to get to a result, it only concerns itself with getting the result. Stats can be a tool to get there along with many other things.

0-_-0 · on Aug 29, 2019

You can train the network afterwards with backprop.

s_Hogg · on Aug 29, 2019

I'm pretty sure this was posted a while back (maybe a month or two)?

mannykannot · on Aug 29, 2019

Yes, with some insightful and informative comments:

https://news.ycombinator.com/item?id=20160693

TekMol · on Aug 29, 2019

Is each architecture given one set of random weights?

Or is the architecture of the net tested against a bunch of random weights so that it performs well independently of the weights?

patresh · on Aug 29, 2019

Each architecture is tested multiple times against different samples of the shared weight

From the paper :

(1) An initial population of minimal neural network topologies is created

(2) each network is evaluated over multiple rollouts, with a different shared weight value assigned at each rollout

(3) networks are ranked according to their performance and complexity

(4) a new population is created by varying the highest ranked network topologies, chosen probabilistically through tournament selection

TekMol · on Aug 30, 2019

What is a "shared weight"?

patresh · on Aug 31, 2019

Usually in neural networks, each neuron has their own weights with different values that will be tuned during training.

Shared weight here means that every neuron in the network shares the exact same weight value.

reiinakano · on Aug 29, 2019

Neither. The architecture performs well independently of weights BUT all weights must be the same e.g. Same network works well when all weights are 5.0 or when all weights are -3.0

TekMol · on Aug 30, 2019

Hmm... that collides with what patresh said.

reiinakano · on Aug 30, 2019

No, it doesn't. Shared weight means all weights are the same value, as I originally mentioned.

DoctorOetker · on Aug 29, 2019

how is this different from boring old evolutionary algorithms?

In my opinion the big breakthrough that enabled optimization and machine learning was the discovery of reverse mode automatic differentiation, since the space or family of all possible decision-functions is high dimensional, while the goal (survival, reproduction) is low dimensional. Unless I see a mathematical proof that evolutionary algorithms are as efficient as RM AD, I see little future in it, and apparently neither did biology since it decided to create brains.

It's not an ideological stance I take here (of nature vs nurture).

For simplicity, lets pretend humans are single-cellular organisms, what does natural selection exert pressure on? our DNA code: both the actual protein codes and the promotor regions. I claim that variation on the proteins are risky (a modification in a proteinn coding region could render a protein useless) while a variation on the promotor regions is much less risky: altering a nucleotide there would slightly affect the affinity modulating transcription, so the cell would behave essentially the same but with different treshold concentrations, think of continuous parameters that describe our body (assuming same nurture, food, etc) some people are a bit taller, some people a bit stronger, etc... so how many of these continuous parameters do we have? On the order of the same number as the total number of promotor regions in DNA in the fertilized egg: both on human DNA and in one mitochondria (assuming there isn't a chemical signals addressing and reading and writing scheme for say 10 mitochondria)...

EDIT: just adding that for a certain fixed environment, there are local (and a global) optimum of affinity values for each protein, so that near a local optimum the fitness is roughly shaped like -s(a-a_opt)^2 where s is spread and a_opt the local optimum affinity value. In other words, it is not so that "better affinity", means fitter, not at all, a collection of genomes from an identical environment will hover around an affinity sweet spot.

According to wikipedia [0] that would result in about

about 2x 20412 "floats" for just protein-coding genes

about 2x 34000 "floats" when also including the pseudo-genes

about 2x 62000 "floats" when also including long ncRNA, small ncRNA, miRNA, rRNA, snRNA, snoRNA

these "floats" are the variables that allow a species to modulate the reaction constants in the gene regulatory network, since natural selection can not directly modulate the laws of physics and chemistry, and modulating the protein directly instead of the promotor region affinities / reaction rates risks disfunctional proteins...

so my estimate of an upper limit of the number of "floats" in the genetic algorithm is ~120000 (and probably much less if not each of the above has a promotor region).

thats not a lot of information, if we think about the number of synaptic weights in the brain, and many of these are shared in utilization by the other cell types besides neurons.

I consider the possibility that: sperm cell, egg cell, or fertilized egg cell performs a kind of POST (power-on-self-test) that checks for some of the genes, although simply reaching the fertilized state may be enough of a selftest so no spontaneous abortion test may be needed (to save time and avoid resources spent on a probably malformed child).

[0] https://en.wikipedia.org/wiki/Human_genome#Molecular_organiz...

EDIT2: regarding:

>This makes WANNs particularly well positioned to exploit the Baldwin effect, the evolutionary pressure that rewards individuals predisposed to learn useful behaviors, without being trapped in the computationally expensive trap of ‘learning to learn’.

The computationally expensive trap of having to 'learn to learn' could end up being as mundane as a low number of hormones to which neurons in the brain globally or collectively respond, which enables learning by reward or punishment, and from then on anticipating reward or punishment, and our individual end goal stems from this anticipation, and anticipating the anticipation etc...

rectangletangle · on Aug 29, 2019

>Unless I see a mathematical proof that evolutionary algorithms are as efficient as RM AD, I see little future in it, and apparently neither did biology since it decided to create brains.

In the biological context the complexity is shifted away from the actual selection algorithm and onto the "scoring function." Although the filter of reproduction is relatively simple [1], the reason why the organism was fit and could ultimately reproduce is very complex. The brain's "topology" is the result of selection pressure favoring adaptations that improve fitness by dynamically adapting to relevant patterns found in the environment. The brain attempts to accurately model important aspects of the complex environment it's challenged with to increase fitness.

[1] Nothing in biology is ever actually simple, even though individuals undergo fitness based scoring, the actual notion of individual is arbitrary. In the case of bees most of the individual bees in the colony are sterile, which is obviously bad for fitness of the "individual." However a few individuals reproduce, so all the sterile worker's fitness is shifted onto the queen/drones through the concept of inclusive fitness. This indirect selection also applies to humans, who undergo inclusive fitness through their siblings. This can even be generalized to your individual cells/organs which function as a colony supporting the gonads which actually undergo direct selective pressure.

DoctorOetker · on Aug 29, 2019

with efficiency I meant computational efficiency:

consider the task of computing a gradient at a point p0 = <x1, x2, x3, ..., xN> in an N-dimensional space.

the naive approach was for a long time: compute the value of the score at p0, then for each coordinate compute the score for the same point but shifted a delta in the direction of that coordinate, i.e. the i-th component is computed as:

component_i = (score( <x1, x2, ..., xi+delta, ..., xN> ) - score (p0))/delta

thats N+1 evaluations or trials of the score to compute the final gradient <component_1, ..., component_N>

notice how reminescent this is of natural selection: the average of the last generation p0 is used to generate ~N trials, which then result in the average of the next generation shifting somewhat.

compare reverse mode automatic differentiation to calculate a gradient: one forward pass of the computation with one backward pass...

I am not complaining about the complexity of the fitness or scoring function, I am complaining about trial and error approaches, when we have discovered a rocket for differentiation!

rectangletangle · on Aug 29, 2019

Ah, in that sense it's almost certainly more efficient. Biological selection is undirected so progress toward the goal happens stochastically, so slow, but then sometimes fast.

xiler · on Aug 29, 2019

In the related work section of their paper they mention that WANNs are related to genetic programming [1], a subfield of evolutionary algorithms.

Genetic programming is quite a powerful tool. IIRC, a few years ago, a researcher evolved expressions to model the dynamics of a double pendulum based only on measured data. To his surprise he found that the expressions were the Lagrangian of the system.

[1] https://en.wikipedia.org/wiki/Genetic_programming

DoctorOetker · on Aug 29, 2019

but a double pendulum isn't complicated at all!

how many floats are there? 2 lengths, 2 initial angles, a 2-dimensional velocity, a mass, and a gravitational field strength? thats like 8 floats...

elcomet · on Aug 29, 2019

I don't agree with your breakthrough. Reverse mode automatic différenciation is really very simple and is not at all what got us there.

It's finding the right architectures that capture data priors and symmetries, and that can learn efficiently with stochastic gradient descent. Automatic differentiation is just a useful tool to use with those ideas.

DoctorOetker · on Aug 29, 2019

I don't dispute the importance of architecture!

But there was nothing new about architecture, essentially it is the choice of multidimensional function one tries to fit. In physics we have been fitting functions for hundreds of years. If you look at some experimental plot of say interference then you might decide to fit a sinusoid to it plus a background constant etc... the importance of fitting the right kind of function is obviously important, but we didn't know about algorithmic differentiation for hundreds of years (and it surely would have been welcome back then, even if performed by hand, it beats trial and error gradients).

That RM automatic differentiation is simple is easy to say in hindsight!

I don't think a richer diversity of functions is a bad idea, but it's already being used, softmax, exponents, sums, squares, ... why not perform gradient descent over a differentiable family of function that encompass these?

It's really disingenious to pretend RM AD was so very simple and then watch approvingly how someone throws it out the window and reverts to ... genetic programming? You want to let the computer find the best functions? fine, but then give the computer a superfunction which for certain values of an extra parameter differentiably reaches the functions you want to be considered.

Most of the architectures ... end up looking suspiciously much like plain old statistical physics! It's like we repeatedly witness how yet another introductory statistical physics expression turns out to perform well on very general sets of tasks (it really comes across as if everything should be treated like a dumb mole of water, and we never tried before because we simply refused to believe it could be that simple).

elcomet · on Aug 29, 2019

I'm not sure I understand what you're saying. I think you're talking about gradient descent, not rm autograd. Gradient descent, which is a breakthrough dates back to the 19th century. That's what allow us to approximate gradients. RM autograd is a clever implement detail of this (how to compute gradients efficiently).

buboard · on Aug 29, 2019

Evolutionary algorithms are perhaps easier to approach theoretically. Backprop works amazingly but how does one begin to approach why. There is also an element of backprop in evolution via epigenetics

p1esk · on Aug 31, 2019

Evolutionary algorithms are perhaps easier to approach theoretically

Why?

DoctorOetker · on Aug 31, 2019

>Evolutionary algorithms are perhaps easier to approach theoretically.

I would certainly welcome recommendations on theoretical approaches of evolutionary algorithms, both for my own review (there is always new insights to be gained), and to have better quality pointer when I know from discussion that the counterparty would benefit greatly from insights into things like the Fischer equations etc:

From higher to lower preference, by format:

1. Open courseware

2. Books

3. Reviews (in the scientific article sense)

4. Articles (same)

From lower to higher preference, by content:

1. Must include a rigorous modern mathematical treatment of the "modern synthesis"

2. The same as 1), but also including information theory connections.

3. The same as 1), but also including the post-modern synthesis ideas.

4. The same as 1), but also including both 2) and 3)

And from highest to lowest preference, by presentation:

1. Theoretical, with numerical exercises (of equations), with numerical simulations, and theorems and proofs.

2. Theoretical, with numerical simulations.

3. Theoretical.

>Backprop works amazingly but how does one begin to approach why.

In my opinion, we fully understand why backpropagation works, but we are still mystified by the exact meaning of the weights and architecture (although we slowly start understanding facets here and there). Another issue is: if a person makes a claim, we can ask why, and the person will explain in human terms why he believes something. Currently the network itself does not understand our question of why, so we concoct mathematical tricks to "ask" the network why, which is not entirely the same thing as having a network make a conclusion and then explain why interactively. But the backpropagation itself is entirely understood IMHO: it's optimization for a better score.

>There is also an element of backprop in evolution via epigenetics

I only ever use the word epigenetics in arguments against the usage of the concept. As far as I can tell, everyone seems to reference a different concept or anomaly or deviation with epigenetics. Its like talking about "new physics" without specifying what unexplained phenomenon it hypothesises or adresses. Even worse: sometimes it simultaneously proposes a mechanism and a hypothetical unobserved deviation. There is no agreement on what it is it has to explain nor on how it is to be explained. How can I even comment on "backpropagation in evolution via epigenetics" then?

[At least in the operation of the brain there is a very clear dichotomy between observed interactions between neurons in the brain" and training a digital neural network on a computer: we perform backpropagation in our algorithms, but to my knowledge we have not unambiguously identified the biological mechanism through which it arises. We know that IF the brain uses reverse-mode automatic differentiation that it must entail retrograde signalling from the post-synaptic to the pre-synaptic neuron, but such feedback mechanism has not been positively identified to my knowledge]

The most common interpretation is this, from wikipedia:

>Epigenetics most often denotes changes that affect gene activity and expression, but can also be used to describe any heritable phenotypic change. Such effects on cellular and physiological phenotypic traits may result from external or environmental factors, or be part of normal development. The standard definition of epigenetics requires these alterations to be heritable,[3][4] in the progeny of either cells or organisms.

To the extent it refers to simply cellular differentiation why not simply state "cellular differentiation". Cellular differentiation is fully understood in a conceptual level and does not require a second storage mechanism besides DNA: simple concentration levels suffice. The affinities of binding regions, and the chemical reaction constants set up a differential equation that can be theoretically simulated (say numerically by the Gillespie algorithm). The same unique DNA code admits multiple cell types: how? The homeostatis response is multistable, just like 2 identical flip flops from the same manufacturing line can memorize different states: if the concentration or voltage wanders from a stable equilibrium point, the flip flop will correct it back to the nearest stable equilibrium point. Pull it over the unstable equilibrium point and it will switch to the other stable equilibrium point. This may or may not involve histones etc, but those are chemical species like any other in the cell. Even without histones you can have a single feedback mechanism (specified by the genome) that supports multiple stable points (the cell types), just like the brain does not need a homunculus for its identity, so the cell does not need a cell type-unculus to remember its cell type, it just looks at the current cellular concentrations and homeostatically corrects it in a direction and rate that is a function of the current cellular concentrations. It is my interpretation that a lot of epigenetic talk, and histones and methylation are a kind of search for the unnecessary "cell type-unculus", stemming from a lack of knowledge that a single set of differential equations can imply multiple stable points. Those who are unaware in this way would benefit from reading the very accessible book by Steven Strogatz "Chaos and Nonlinear Dynamics".

For a single cellular organism, this explains heritable information that is not encoded in DNA: after cellular division the daughter cells have roughly identical concentrations as the mother cell did.

For a multi-cellular organism, this explains cell types.

[EDIT: Perhaps a simpler way to express this argument is using a simpler model for gene regulation, instead of continuous concentration levels, pretend they are binary, as in binary (on or off) concentrations: a combinational logic circuit has no memory, but a sequential one, where some outputs are fed back as inputs can have memory, so upon cellular division the state of the daughter cell boolean concentrations will be nearly identical as the mother cell concentrations (apart from those concentrations involved in cellular division itself), so they will have the same cell type as the mother cell A -> A + A (unless it crucially depends on some of the signals involved in the process of cellular division, which allows A -> A + B, or A -> B + C, where A,B,C are distinct cell types), in theory extracellular signals entering the cell could prompt it to change cell type too: A + signal -> B]

About say methylation as an explanation for heritable traits for multicellular organisms: how does one propose that the methylation state is copied when DNA is duplicated during cell division?

Any time you have the urge to use the word "epigenetics" ask yourself if you perhaps just mean "cellular differentiation", if you can be precise, why not be precise?

ianamartin · on Aug 29, 2019

Next thing you know, google will be telling us that the fastest websites are server side rendered from templates with minimal JavaScript.

How could anyone possibly have known?