It deals with the classic, and wonderful, question of "If I go and catch 100 birds, and they're from 20 different species, how many species are left uncaught?" There's more one can say about that than it might first appear and it has plenty of applications. But mostly I just love the name. Apparently PNAS had them change it for the final publication, sadly.
This sounds like a similar problem to an exercise in the famous probability textbook by William Feller about estimating the total number of fish in a lake by catching N of them and tagging them, and then throwing them back in the lake and coming back later to catch another N fish. You check how many of those fish are tagged and derive your estimate from that using the hypergeometric distribution. See the pages here:
Yeah, exactly. If you wanted to know that your code was bug free, how could you do it? Set a team of experts to each independently scour for bugs. But when do you stop? The quick answer is that you should keep going until every bug you've found, has been found at least twice. I think of this as being that you "just barely" found a bug if only one person identified it, so there are probably still bugs you have "just barely" not found remaining.
Set a team of experts to find bugs independently for some specified amount of time. Then look at how many of the same bugs were found by multiple experts.
If most of the bugs were found by multiple experts, then there are probably not that many more bugs than the total number that they found. If most of the bugs were found by only one expert, then there are probably a lot more than the total that they found.
With some math you can pin down the 'probablies' to numeric ranges.
Is that a direct application of the paper or something else? (sorry I didn't read it)
Just wondering because this rule of thumb sounds intuitively wrong to me. Depending on the difficulty of the bugs and the skill levels of the experts, it seems possible for them to find every "easy" bug at least twice while having none of them finding the hardest bug even once. (real world example would be some obscure zero-day security bug)
I should have said "at least until" rather than "until", sorry. It's from a result that predates the paper, due to Fisher (using butterflies, not code bugs).
The actual guarantee from the result is not that the number of unobserved "species" is small, but that the total population of all unobserved species is small. If you go back to the birds example, then you could say something like "at most 0.1% of all birds are from species that we haven't identified" but maybe those 0.1% of birds are from a million different species each with incredibly tiny populations. In the code bug example, the very rare species would be the bugs that are very unlikely to be found, i.e. it's more about estimating how many more bugs you will find if you continue to analyze it than how many are really there.
Does that not assume equal probability of finding bugs? Seems to me that both teams would find the same easily identified bugs (say '=' instead of '==' or revealed by compiler warnings).
I've always hated build systems. Stuff cobbled together that barely works, yet a necessary step towards working software. This paper showed me there's hope. If we take build systems seriously, we can come up with something much better than most systems out there.
I've recently started learning C++ and have had to grapple with the complexities of CMake. I'm only a few pages into the paper but it's already done a great job at distilling the problem domain and the core components of a build system.
I also found the beginning to be a great introduction to `make` and the build dependency tree.
Major breakthrough in cell-free diagnostics. The methylation pattern of DNA can be used to identify early-stage cancer, i.e. circulating tumor DNA (ctDNA) has a distinct methylation pattern.
The results are based on data from a ten year study which must have cost a fortune to run.
Combine that with the breakthrough in protein folding and mRNA vaccines and we could have a rapid pipeline for custom, targeted immunotherapies for not just new bugs, but new cancers.
Goes to show the complete lack of agreement between researchers in the explainability space. Most popular packages (allen NLP, google LIT, Captum) use saliency based methods (Integrated gradients) or Attention. The community has fundamental disagreements on whether they capture anything equivalent to importance as humans would understand it.
An entire community of fairness, ethics and Computational social science is built on top of conclusions using these methods. It is a shame that so much money is poured into these fields, but there does not seem to be as strong a thrust to explore the most fundamental questions themselves.
(my 2 cents: I like SHAP and the stuff coming out of Bin Yu and Tommi Jakkola's labs better..but my opinion too is based in intuition without any real rigor)
These are deep neural net papers, specifically in NLP.
Explanation: why the model (the deep neural net, that is) is doing what it's doing.
Attention: a particular technique used in certain deep net models, invented a few years ago, that originally showed remarkable performance improvements in NLP (natural language processing) tasks, but has recently been applied in vision and elsewhere.
recently, a lot of neural network models, especially those in NLP (like GPT-3, BERT, etc.) use "attention" which basically is a way for neural networks to focus on certain subset of the input (the neural network can focus its "attention" to a particular part of the input). Explanations just refers to ways for explaining the predictions of the neural networks.
I think one of the most interesting papers I read this year was Hartshorne & Germine, 2015, "When does cognitive functioning peak? The asynchronous rise and fall of different cognitive abilities across the life span": https://doi.org/10.1177/0956797614567339
There are lots of good bits, such as: 'On the practical side, not only is there no age at which humans are performing at peak on all cognitive tasks, there may not be an age at which humans perform at peak on most cognitive tasks. Studies that compare the young or elderly to “normal adults” must carefully select the “normal” population.' (italics in original)
This seems to me to comport with the research suggesting that most or all of the variance in IQ across the life span can be accounted for by controlling for mental processing speed; i.e., you are generally faster when you are younger, but you are not more correct when you are younger.
Apologies for pasting all of this, but the excerpt has always stuck with me. It seem correct. Are there alternative explanations other than mental processing speed and so on? (For example, later in life, you're less likely to be in a position to do the same sort of work. But that seems to have been tested by e.g. the institute of advanced study.)
As far as I can tell, this section might be one of those facts that people try not to think about too much. I don't worry about it, but I end up thinking about it a lot.
There was recently some headline news about an older mathematician that made a significant breakthrough. Other than that one outlier, have there been many important contributions made by people after the age of, say, 45?
--
"I had better say something here about this question of age, since it is particularly important for mathematicians. No mathematician should ever allow himself to forget that mathematics, more than any other art or science, is a young man's game. To take a simple illustration at a comparatively humble level, the average age of election to the Royal Society is lowest in mathematics.
We can naturally find much more striking illustrations. We may consider, for example, the career of a man who was certainly one of the world's three greatest mathematicians. Newton gave up mathematics at fifty, and had lost his enthusiasm long before; he had recognized no doubt by the time that he was forty that his great creative days were over. His greatest ideas of all, fluxions and the law of gravitation, came to him about 1666, when he was twenty-four—'in those days I was in the prime of my age for invention, and minded mathematics and philosophy more than at any time since'. He made big discoveries until he was nearly forty (the 'elliptic orbit' at thirty-seven), but after that he did little but polish and perfect.
Galois died at twenty-one, Abel at twenty-seven, Ramanujan at thirty-three, Riemann at forty. There have been men who have done great work a good deal later; Gauss's great memoir on differential geometry was published when he was fifty (though he had had the fundamental ideas ten years before). I do not know an instance of a major mathematical advance initiated by a man past fifty. If a man of mature age loses interest in and abandons mathematics, the loss is not likely to be very serious either for mathematics or for himself.
On the other hand the gain is no more likely to be substantial; the later records of mathematicians who have left mathematics are not particularly encouraging. Newton made a quite competent Master of the Mint (when he was not quarrelling with anybody). Painlevé was a not very successful Premier of France. Laplace's political career was highly discreditable, but he is hardly a fair instance, since he was dishonest rather than incompetent, and never really 'gave up' mathematics. It is very hard to find an instance of a first-rate mathematician who has abandoned mathematics and attained first-rate distinction in any other field.1 There may have been young men who would have been first-rate mathematicians if they had stuck to mathematics, but I have never heard of a really plausible example. And all this is fully borne out by my own very limited experience. Every young mathematician of real talent whom I have known has been faithful to mathematics, and not from lack of ambition but from abundance of it; they have all recognized that there, if anywhere, lay the road to a life of any distinction.
First, a note: math is one of the specific areas where humans generally peak really quite young. Math (quantitative reasoning/logic) is not the only area of psychometric intelligence testing (e.g., IQ) and not even a majority of it. So, it may be that it really is a bit harder for older mathematicians to make breakthroughs? I don't know. At any rate, citing mathematics as an indicator is probably not ideal, because math ability does indeed generally peak early.
However, as of 2011[1] the mean age of physics Nobel winners at the time of their achievements across the entire period of the award was 37.2 and since 1985 the mean age was 50.3.
According to the same paper, by the year 2000, Nobel-level achievement in physics before age 40 was only 19% of cases. It also appears that awards in chemistry and medicine are similarly increasing in mean age.
Is this dispositive? Certainly not. Maybe the Nobel committee prefers to award old scientists because of some unknown bias?
However, it does indicate that high achievement is both possible and normal in middle age and beyond.
Specifically responding to the increasing average age of Nobel prize winners: this is in part due to the increasing complexity of problems to solve. With our current ways of solving problems, the new problems become harder and harder. The existing human knowledge is also becoming harder and harder to understand, requiring somebody working in a field to spend much longer studying and catching up to the state of the art before being able to make a significant contribution to the field.
This is one of the reasons that I'm personally so excited about (and working on) the potential of spatially immersive media like VR to understand complex concepts. Taking a step back, tools like a graph plot enabled humans to understand complex concepts like differentials and projectile motion at a much younger age. Could a breakthrough with new ways of understanding human knowledge effectively do the same with knowledge that is today considered complex (eg, quantum mechanics)? If such a breakthrough happens, could we bring the average age of significant contribution in subjects like physics back down?
I don't know, but I hope so. :)
Edit. I also remember reading thoughts by either Michael Nielsen on the increasing age of Nobel-worthy contributions in physics, but I can't find it in my current sleep deprived state. I shall tomorrow if somebody else hasn't pointed to that article by then.
Maybe, but I somewhat doubt it. Nobel-level work is usually derivative of only a few basic concepts, but is otherwise quite daring. The academic guild system has become nearly impossible to get through, and all corporate or government research depends on passing through that system first. Nobody can just get to work. First you have to get in. 99% of applicants are mostly concerned with prestige or career opportunities. 1% wants to do research. Then you need funding. This comes by helping professors on their ideas, not yours. Then you need a job. Better pick a popular field and find ‘business value’. Now it’s time to buy a house. Maybe you’ll get back to that big idea you had once you pay it off. Student loan availability turned the academic pipeline into a job requirement, and a job is then required to pay it off. We’re just entering the era of Nobel winners that started school during the Vietnam boom. It seems to me that the Nobel will cease to have any meaning in a decade or so. The trend is that contemporary prizes are given to politically valued choices from a huge field of contributors, or as an honorarium for ‘famous’ professors. The last 20 years of minted professors are far more focused on job security than great research, and it would be surprising to see a lot of individual breakthroughs at the Nobel level. And then there are corporate labs, which are run by professional managers handing out nebulous quarterly objectives with a side of panic. Forget about it. The biggest hope for ambitious research may be self-funded entrepreneurs. There must be, somewhere along that path to human colonization of space, a Nobel for Elon.
Good point. I call this problem the Giant's Shoulder Climbing Problem. Isaac Newton said he could only see farther because he was standing on the shoulders of giants. By giants he meant all the knowledge amassed by previous generations. The problem is, nowadays the giants got so big, that one can spend the better part of a life just climbing the damn giant, many failing to reach the ever receding shoulders.
I've pondered on this problem a bit before. To solve it, I reached your same conclusion, that we need some breakthrough with new ways of understanding human knowledge simplifying knowledge that is today considered complex, or in other words, we need at least try to build some sort of elevator.
IMHO, the most promising possible breaktroughs I could find were:
(I) a reform in math education, with early introduction of schoolchildren to computer algebra system (CAS) software, shifting curriculum away from tedious manual computations and trick learning. When in university, for example, I learned lots of integration tricks, and forgot most of them a few years later. Would my time had better invested just learning SymPy instead of all those tricks? This idea is pushed by Conrad Wolfram. For example. See his talk at https://youtu.be/jE9lU4E52Vg
(II). a reform of physics education to replace vector algebra with proper geometric algebra, as advocated by David Hestenes. Vector algebra as taught in physics today is actually a hack pushed by Gibbs, that only works well in 3D and demands a lot of shoehorning to work in problems with higher dimensionality. Geometric algebra scales well in any number of dimensions, and many problems become easier. The four Maxwell equations, for example, became one. See the discussion in https://physics.stackexchange.com/a/62822 :
"Now, the contention is that Clifford algebra is under-utilized in basic physics. Every problem in rigid-body dynamics is at least as easy when using Clifford algebra as anything else — and most are far easier — which is why you see quaternions being used so frequently. Orbital dynamics (especially eccentricity!) is practically trivial. Relativistic dynamics is simple. Moreover, once you've gotten practice with Clifford algebra in the basics, extending to electrodynamics and the Dirac equation are really simple steps. So I think there's a strong case to be made that this would be a useful approach for undergrads. This could all be done using different tools, of course — that's how most of us learned them. But maybe we could do it better, and more consistently.
No one is claiming that Clifford algebra is fundamentally new; just that it could be bundled into a neater package, making for easier learning. Try teaching a kid who is struggling with the direction of the curl vector that s/he should really be thinking in terms of the algebra generated by the (recently introduced) vector space, subject only to the condition that the product of a vector with itself is equal to the quadratic form. Or a kid who can't understand Euler angles that this rotation is better understood as a transformation generated (under a two-fold covering) by the even subalgebra of Cl3,0(R). No one here is arguing that that should happen. GA is just a name for a pedagogical approach that makes these lessons a whole lot easier than they would be if you sent the student off to read Bourbaki. Starting off with GA may be slightly harder at the beginning, but pays enormous dividends once you get to harder problems. And once teachers and textbooks get good at explaining GA, even the introduction will be easier."
>There was recently some headline news about an older mathematician that made a significant breakthrough. Other than that one outlier, have there been many important contributions made by people after the age of, say, 45?
There is one other effect beyond ability: people over 45 rarely take up new interests. There are plenty of cases of people who continue working in the same field through their 40s and 50s and continue to make advances, but there are many fewer cases of an individual beginning to work in a field after 40 and going on to make a major discovery. In pure math, possibly the most aesthetic-driven technical field (it is impractical by definition), this effect might be especially strong.
Richard Hamilton (not to be confused with William Rowan Hamilton of action-principle fame) initiated the application of the Ricci flow to the geometrization conjecture in 1982 at 39 and continued his work through the '90s, being credited by Perelman as making crucial contributions to the final solution of the Poincare conjecture in three dimensions. He's probably the most prominent example.
>It is very hard to find an instance of a first-rate mathematician who has abandoned mathematics and attained first-rate distinction in any other field.
For me, it was "Erasure Coding in Windows Azure Storage" from Microsoft Research (2016) [0]
The idea that you can achieve the same practical effect of a 3x replication factor in a distributed system, but only increasing the cost of data storage by 1.6x, by leveraging some clever information theory tricks is mind bending to me.
If you're operating a large Ceph cluster, or you're Google/Amazon/Microsoft and you're running GCS/S3/ABS, if you needed 50PB HDDs before, you only need 27PB now (if implementing this).
The cost savings, and environmental impact reduction that this allows for are truly enormous, I'm surprised how little attention this paper has gotten in the wild.
The primary reason why you should be using 3x or higher replication is the read throughput (which makes it only really relevant for magnetic storage). If the data is replicated 1.6x then there's only 1.6 magnetic disk heads per each file byte. If you replicate it 6x then there's 6 magnetic disk heads for each byte. At ~15x it becomes cheaper to store in SSD with ~1.5x reed-solomon/erasure code overhead since SSD has ~10x the per-byte cost of HDD.
(there are also effects on the tail latency of both read and write, because in a replicated encoding you are less likely to be affected by a single slow drive).
(also, for insane performance which is sometimes needed you can mlock() things into RAM; the per-byte cost of RAM is ~100x the cost of HDD and ~10x the cost of SSD).
Everything you just said is on point, but I think that's an orthogonal thing to what the paper is going for. Hot data should absolutely have a fully-materialized copy at the node where operations are made, and an arbitrary number of readable copies can be materialized for added performance in systems that don't rely on strong consistency as much.
However for cold-data, there really hasn't been (or at least I am unaware of) any system that can achieve the combined durability of 1.5x Reed-Solomon codes + 3x replication, with such a small penalty to storage costs.
Like you said though, it's definitely not the thing you'd be doing for things that prioritize performance as aggressively as the use-cases you've suggested.
~1.5x reed solomon is the default these days, again, unless you need read throughput performance. It is awesome :)
Also, these days the storage of the data doesn't have to be at the same machine that processes the data. A lot of datacenter setups have basically zero transfer cost (or, alternatively, all the within-DC transfer cost is in the CAPEX required to build the DC in the first place), ultra low latency, and essentially unlimited bandwidth for any within-datacenter communication. This doesn't hold for dc1->dc2 communication, in particular it is very very far from the truth in long distance lines.
One way to think about the above is that datacenters have become the new supercomputers of the IBM era - it's free and really fast to exchange data within a single DC.
Also2, this is completely independent of consistency guarantees. At best it relates to durability guarantees, but that I want from all storage solutions. And yes, properly done reed solomon has the same durability guarantees as plain old replicated setup.
Also to the above also2, single-DC solutions are never really durable as the DC can simply burn down or meet some other tragic end, you need geographic replication if your data cannot be accidentally lost without serious consequences (a lot of data actually can be lost, in particular if it is some kind of intermediate data that can be regenerated from the "source" with some engineering effort). This is not just a theoretical concern, I've seen "acts of God" destroy single-DC setups data, ay least partially. It is pretty rare, though.
I'm confused, as you don't seem to be replying to any point I've made...
> ~1.5x reed solomon is the default these days, again, unless you need read throughput performance
I'm not surprised that Reed-Solomon is the "default these days" given that it exists since the 1960's, and that the most widely available and deployed open-source distributed filesystem is HDFS (which uses Reed-Solomon). However I don't see how that is to be taken as a blind endorsement for it, especially given that the paper in reference explicitly compares itself to Reed-Solomon based systems, including concerns regarding reconstruction costs, performance, and reliability.
> Also, these days the storage of the data doesn't have to be at the same machine that processes the data
Even though what you said here is correct, I don't see how that's relevant to the referenced paper, nor do I think I implied that I hold a contrary belief in any way from what I said.
> Also2, this is completely independent of consistency guarantees
My comment about consistency referred only to the fact that you cannot "simply" spin up more replicas to increase read throughput, because consistent reads often have to aqcuire a lock on systems that enforce stronger consistency, so your comments regarding throughput are not universally true, given that there are many systems where reads cannot be made faster this way, as they are bottle-necked by design.
> Properly done Reed-Solomon has the same durability guarantees as plain old replicated setup
This is not true unless the fragments themselves are being replicated across failure domains, which you seem to address with your next comment with "you need geographic replication if your data cannot be accidentally lost without serious consequences". All of this, however, is directly addressed in the paper as well:
> The advantage of erasure coding over simple replication is that it can achieve much higher reliability with the same storage, or it requires much lower storage for the same reliability. The existing systems, however, do not explore alternative erasure coding designs other than Reed-Solomon codes. In this work, we show that, under the same reliability requirement, LRC allows a much more efficient cost and performance tradeoff than Reed-Solomon.
It's not even the reduction in storage costs in this paper that is groundbreaking. They talk about a way to not only reduce storage costs, but optimize for repairs. Repairs are costly at scale and reducing resources where possible: network, cpu, disk reads, etc is ideal.
and,
[2] HeART: improving storage efficiency by exploiting disk-reliability heterogeneity - https://www.usenix.org/conference/fast19/presentation/kadeko... . This paper talks about how just one erasure code is not enough and employing code conversions over the disk-reliability we can get up to 30% savings!
The Google File System (GFS) paper from 2003 mentions erasure codes. Which isn't to say they did it then, but rather that the technique of using erasure coding was known back then.
(And surely before GFS too, I just picked it as an example of a large data storage system that used replication and a direct predecessor to the systems you mentioned.)
CDs (remember those? lol) also implemented Reed-Solomon erasure codes for the stored data, erasure codes in storage systems aren't new at all, and that's not what this paper is about.
I actually found out about this paper because it was referenced in a slide presentation from Google about Colossus (which is the successor to GFS). GFS indeed uses erasure coding with a 1.5x factor, but erasure coding alone does not guarantee durability, and thus needs to be combined with replication to satisfy that requirement, and erasure coding is not the same thing as replication.
The innovation here is explicitly the combination of a new erasure coding algorithm (LRC) AND replication, with a combined storage amplification that is much lower than the previous SOTA.
The paper explicitly compares the new algorithm (LRC) with GFS and other alternatives, and explains why it's better, so this is really not something that is comparable to the 2003 GFS paper in any way (or to any other prior art really), as this is not just a trivial application of erasure coding in a storage system.
There's also this paper [0] from 2001 which digs a bit deeper into the Erasure Codes vs Replication idea that I can recommend if you're interested
I think for the major players you mentioned the 2016 paper was retrospective. Everyone was already doing it. Even mid-tier players like Dropbox Magic Pocket were using erasure coding by 2016, and their scheme was mostly written by ex-Google engineers influenced by Colossus.
Oh I am absolutely aware that erasure codes are an old thing, Reed-Solomon codes exist since the 1960's, but this is not simply a trivial application of erasure coding to a storage system: erasure codes alone don't provide the same durability guarantees that replication does. [0]
This is a combination of erasure coding AND replication, whose combined storage amplification is dramatically lower than previous SOTA.
I gave a longer explanation in a sibing comment to yours [1]
Thanks for the clarification. I still think these techniques were somewhat widespread already ... see for example this figure from US Patent 9292389 describing nested, aka layered coding that to my thinking is isomorphic with the "LRC" or "pyramid code" described by the Microsoft researchers.
By the way, not at all trying to say this paper isn't interesting. I keep it in my filing cabinet to show my colleagues when I need to describe this technique, since Google hasn't ever bothered to describe Colossus in a way I can reference.
Row types are magically good: they serve either records or variants (aka sum types aka enums) equally well and both polymorphically. They’re duals. Here’s a diagram.
Construction Inspection
Records {x:1} : {x:Int} r.x — r : {x:Int|r}
[closed] [open; note the row variable r]
Variants ‘Just 1 : <Just Int|v> case v of ‘Just 0 -> ...
[open; note the row var v] v : <Just Int>
[closed]
Neither have to be declared ahead of time, making them a perfect fit in the balance between play and serious work on my programming language.
I love polymorphic records/variants. Variants particularly are amazing for error propagation. Records of course are useful in many of the same places tuples and structs are. The main reluctance I have is whether to allow duplicate entries in records. If you allow them, many things become much easier, but they make records inherently ordered when they weren’t previously
It's from 2017 but I first read it this year. This is the paper that defined the "transformer" architecture for deep neural nets. Over the past few years, transformers have become a more and more common architecture, most notably with GPT-3 but also in other domains besides text generation. The fundamental principle behind the transformer is that it can detect patterns among an O(n) input size without requiring an O(n^2) size neural net.
If you are interested in GPT-3 and want to read something beyond the GPT-3 paper itself, I think this is the best paper to read to get an understanding of this transformer architecture.
“it can detect patterns among an O(n) input size without requiring an O(n^2) size neural net”
This might be misleading, the amount of computation for processing a sequence size N with a vanilla transformer is still N^2. There has been recent work however which has tried to make them scale better.
You raise an important point. The proposed solutions are too many to enumerate, but if I had to pick just one currently I would go for "Rethinking Attention with Performers" [1]. The research into making transformer better for higher dimensional inputs is also moving fast and is worth following.
It's clearly important but I found that paper hard to follow. The discussion in AIMA 4th edition was clearer. (Is there an even better explanation somewhere?)
It's crazy to me to see what still feel like new developments (come on, it was just 2017!) making their way into mainstream general purpose undergraduate textbooks like AIMA. It's this what getting old feels like? :-\
I start to understand what you always hear from older ICs about having to work to keep up, or else every undergrad coming out will know things you don't.
Three papers stick out for me in the IML / participatory machine learning space this year:
1) Michael, C. J., Acklin, D., & Scheuerman, J. (2020). On interactive machine learning and the potential of cognitive feedback. ArXiv:2003.10365 [Cs]. http://arxiv.org/abs/2003.10365
2) Denton, E., Hanna, A., Amironesei, R., Smart, A., Nicole, H., & Scheuerman, M. K. (2020). Bringing the people back in: Contesting benchmark machine learning datasets. ArXiv:2007.07399 [Cs]. http://arxiv.org/abs/2007.07399
3) Jo, E. S., & Gebru, T. (2020). Lessons from archives: Strategies for collecting sociocultural data in machine learning. Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, 306–316. https://doi.org/10.1145/3351095.3372829
Also a great read related to IML tooling for audio recognition:
1) Ishibashi, T., Nakao, Y., & Sugano, Y. (2020). Investigating audio data visualization for interactive sound recognition. Proceedings of the 25th International Conference on Intelligent User Interfaces, 67–77. https://doi.org/10.1145/3377325.3377483
Also, what do you mean by "participatory" in the context of machine learning? Is there a seminal paper that defines it?
I ask as in HCI and other fields, participatory had a VERY defined meaning that in short, I'd about equal power, democracy, and inclusivity. I can't understand how that applies to ML and would like to learn more, hence asking you.
I think "participatory" means something similar here within an ML context. It favors building community-based algorithmic systems and focuses on lowering the barrier to participation, so that non-expert users can be involved during the machine learning development cycle.
I'm not aware of any seminal papers per-say, although here are a few that I've read recently... first one is something I maintain at $DAYJOB:
1) Halfaker, A., & Geiger, R. S. (2020). Ores: Lowering barriers with participatory machine learning in Wikipedia. ArXiv:1909.05189 [Cs]. http://arxiv.org/abs/1909.05189
2) Martin Jr. , D., Prabhakaran, V., Kuhlberg, J., Smart, A., & Isaac, W. S. (2020). Participatory problem formulation for fairer machine learning through community based system dynamics. ArXiv:2005.07572 [Cs, Stat]. http://arxiv.org/abs/2005.07572
Alot of IML seems to focus on building interfaces, so this one was pretty good:
1) Dudley, J. J., & Kristensson, P. O. (2018). A review of user interface design for interactive machine learning. ACM Transactions on Interactive Intelligent Systems, 8(2), 1–37. https://doi.org/10.1145/3185517
You might think that it's possible to use machine learning to predict whether people will be successful using established socio-demographic, psychological, and educational metrics. It turns out that it's very hard and simple regression models outperform the fanciest machine learning ideas for this problem.
The way this study was done is also interesting and paves the way for new kinds of collaborative scientific projects that take on big questions. It draws on communities like Kaggle, but applies it to scientific questions not just pure prediction problems.
> simple regression models outperform the fanciest machine learning ideas for this problem
This reminds me of a classic paper: "Improper linear models are those in which the weights of the predictor variables are obtained by some nonoptimal method; for example, they
may be obtained on the basis of intuition, derived
from simulating a clinical judge's predictions, or set to
be equal. This article presents evidence that even
such improper linear models are superior to clinical intuition when predicting a numerical criterion from
numerical predictors."
Dawes, R. M. (1979). The robust beauty of improper linear models in decision making. American psychologist, 34(7), 571.
Genetics and net worth can be blown away by a good or bad group of friends. And unfortunately you start having friends before you are concious enough to realize the impact.
One of my favorites is definitely A Unified Framework for Dopamine Signals across Timescales (https://doi.org/10.1016/j.cell.2020.11.013), simply because of its experimental design. They 'teleported' rats in VR to see how their dopamine neurons responded, to determine whether TD learning explains dopamine signals on both short and long timescales. Short answer: it does.
On the Measure of Intelligence, François Chollet [1]
Fellow HNer seems to have liked a lot of ML paper, this is not breaking the trend.
This is a great meta paper questioning the goal of the field itself, and proposing ways to formally evaluate intelligence in a computational sense. Chollet is even ambitious enough to propose a proof of concept benchmark! [2] I also like some out of the box methods people tried to get closer to a solution, like this one combining cellular automata and ML [3]
Even if not implemented in such a sophisticated manner, "meaningful availability" is a better metric than pure uptime/downtime for most websites.
At one startup we worked at we had availability problems for some time, with the service going down in a semi-predictable manner ~2 times a day (and the proper bugfix a few weeks away). Because once a day the service went down was in the middle of the night with no one on call, pure availability was 80-90%. Given that it was a single country app with no one trying to do any business during the night, meaningful availability was ~99%. Knowing that gave us peace of mind and made tackling the problem a much more relaxed ordeal than the crunch time for a few weeks I've seen at other companies in similar situations.
Keeping CALM: When Distributed Consistency Is Easy
In computing theory, when do you actually need coordination to get consistency? They partition the space into two kinds of algorithm, and show that only one kinds needs coordination.
Not a paper, but a fantastic talk by Samy Bengio "Towards better understanding generalisation in deep learning" at ASRU2019.
Some pretty mind blowing insights - ex: if you replace one layer's weights in a trained classification network with the initialisation weights for the layer (or some intermediate checkpoint as well), many networks show relatively unaffected performance for certain layers ... which is seen as a generalisation since it amounts to parameter reduction. However, if you replace with fresh random weights (although initialisation state is itself another set of random weights), the loss is high! Some layers are more sensitive to this than others in different network architectures.
I recently summarised this to a friend who asked "what's the most important insight in deep learning?" - to which I said - "in a sufficiently high dimensional parameter space, there is always a direction in which you can move to reduce loss". I'm eager to hear other answers to that question here.
Nice. Love the Bengio Bros. Yoshua especially was right there with Geoffrey Hinton, Yann LeCun, Andrew Ng as the earliest pioneers of successful deep learning, for over 20 years. (While most technologists were crazy about this thing called the World Wide Web in the late 1990's, these guys were shaping brain-inspired AI algorithms and representations.)
Anyways, one of the papers by Yoshua that was really influential on my master's thesis was published in 2009, has received 8956 citations to date on Google Scholar, called "Learning Deep Architectures for AI". For many young researchers, even though this paper pre-dates much of the hyped architectures of the current era, I would still recommend it for its timeless views on deep representations as equivalent to architectural units of learning and knowledge, including its breakdown of deep networks as compositions of functions.
thats not surprising to me: if we view the weights in the whole network as "individual cells" in a population, and if we pretend that at each update of network weights, before the update each weight undergoes cell division such that one daughter cell / weight is an increase in weight, and the other a decrease, then each component of the gradient descent vector can be viewed as the fitness function for that specific cell or weight: an increase or a decrease. From this perspective each cell forms its own niche in the ecosystem, and its no surprise that replacing a cell with its ancestor is roughly compatible with the final network cells: the symbiosis goes both ways.
The reason for Bengio demonstrating this on a complete layer is obviously to demonstrate that this is NOT due to redundantly routing information to the next layer (think holographically, for robustness). And using non-ancestor random weights illustrates that the ecosystem fitness suffers if redundant / holographic routing is prevented while also using non ancestral cells / weights...
Unfortunately no .. and my searches haven't been productive. I was referred to this talk by my professor who attended the conference and I got to see only the slide deck as well. .. but the slide deck is very good and easy to follow.
I'm not sure if this is precisely the direction things should go in order to improve the utilisation of specification within software development but it's a very important contribution. As yet my favourite development style has been with F-star but F-star also leaves me a bit in a lurch when the automatic system isn't able to find the answer. Too much hinting in the case of hard proofs.
Eventually there will be a system that lets you turn the crank up on specification late in the game, allows lots of the assertions to be discharged automatically, and then finally saddles you with the remaining proof obligations in a powerful proof assistant.
Chen, Y.W.; Yiu, C.B.; Wong, K.Y. Prediction of the SARS-CoV-2 (2019-nCoV) 3C-like protease (3CL
(pro)) structure: Virtual screening reveals velpatasvir, ledipasvir, and other drug repurposing candidates.
F1000Research 2020, 9, 129.
This paper (based on a machine learning-driven open source drug docking tool from Scripps Institute) from Feb/Mar formed the basis for the agriceutical venture I started for supporting pandemic management in Africa. We’re in late stage trialing talks with research institutes here in East Africa.
One that came across my desk this year was the Archaic Ghost Introgression paper (https://advances.sciencemag.org/content/6/7/eaax5097), which established genetic contribution from an unknown archaic species in modern West African populations. It's notable not only because of the cool findings, but also because the paper is a culmination of a whole number of broader advances.
I have to admit to skim reading, but, Finding and Understanding Bugs in C Compilers, by Yang, Chen, Eide, and Regehr, 2011. (Yes it's from 9 years ago.) It's an interesting and approachable read if you're into programming languages and compilers.
Abstraction has made Programming a hybrid of Tradition/Authority/Science/Art.
It's nearly impossible to have a scientific paper on anything with abstraction. At best you can create some "after the fact" optimizations using time studies and statistics.
I don't mean programming literally (although there are plenty of papers on programming abstractions, especially in the functional world) but just computer subjects in general whether that is cryptography, complexity theory, hardware or something else.
Discovering Symbolic Models from Deep Learning with Inductive Biases [1] trains graph neural nets on astrophysical phenomena and then performs symbolic regression to generate algebraic formulae to elegantly model the phenomena in a classical physics framework. It's largely gone under the radar but has pretty interesting implications for NLP and language theory in my opinion.
Direct Feedback Alignment Scales to Modern Deep Learning Tasks and Architectures [2] applies DFA, an approach to training neural nets without backprop, to modern architectures like the Transformer. It does surprisingly well and is a step in the right direction for biologically plausible neural nets as well as potentially significant efficiency gains.
Hopfield Networks is All You Need [3] analyzes the Transformer architecture as the classical Hopfield Network. This one got a lot of buzz on HN so I won't talk about it too much, but it's part of a slew of other analyses of the Transformer that basically show how generalizable the attention mechanism is. It also sorta confirms many researchers' inkling that Transformers are likely just memorizing patterns in their training corpus.
Edit: Adding a few interesting older NLP papers that I came across this year.
StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding [4]
Do Syntax Trees Help Pre-trained Transformers Extract Information? [5]
Learning to Compose Neural Networks for Question Answering [6]
I yeaaaarn for a future where we will have a general parallelizable training method for neural networks (or even better a principled way to initialize trained weights like the work being done on wavelet scattering). Long training times with backpropagation is a serious obstacle when doing experiments.
I had hoped DFA would be it but it doesn't work for image tasks sadly.
I only started reading papers this year and have only read two: An empirical study of Rust (found this on 4chan of all places), and a study of local-first software (found right here on HN). The latter truly got me thinking about the cloud services I was using and how your work really isn't yours if it's stored faraway on some cloud servers and not on your local machine. The introduction to Conflict-free Replicated Data Types (CRDTs) was excellent as well.
Almost 2020. MuZero from DeepMind was a pretty amazing breakthrough. Single algorithm that can play Atari games, chess, go (and a variety of other board games) with super human ability.
Aren't you one of the authors of that paper? It was a good read. Have a look at https://iko.ai, it solves not only many of the pain points you wrote about, but the "critical" pain points (difficult and important).
- No-setup collaborative notebooks: near real-time editing on the same notebook. Large Docker images.
- Long-running notebook scheduling: you can schedule a notebook right from the JupyterLab interface and continue to work on your notebook without context switch. The notebook will run and you can view its state even if you close your browser or get disconnected, and you can view it without opening the JupyterLab interface, even on your mobile phone. https://iko.ai/docs/notebook/#long-running-notebooks
- Automatic experiment tracking: iko.ai automatically detects parameters, metrics, and models and saves them. Users don't have to remember or know how to write boilerplate code or experiment tracking code.
- One click parametrization: you can publish an AppBook to enable other people to run your automatically parametrized notebook without being overwhelmed by the interface. You don't have to use cell tags or metadata to specify parameters. You click a button and an application is created from your notebook. The runs of this application are also logged as experiments in case they generate a better model. https://iko.ai/docs/appbook/
- Easy deployment: people can look at the leaderboard, and click on a button to deploy the model they choose into a "REST API" endpoint. They'll be able to invoke it with a POST request or from a form where they simply upload or enter data and get predictions.
We haven't focused on the stylesheets given that in our previous projects with actual paying enterprise customers, it wasn't CSS that held us back.
I haven't exactly read many this year, but I really liked "An Answer to the Bose-Nelson Sorting Problem for 11 and 12 Channels" [1]. It describes many interesting algorithmic tricks to establish a lower bound for an easy to understand problem. Not exactly immediately practical, but still very interesting.
Note that it has been published on arXiv just yesterday; I helped review an earlier draft.
Erik Hoel in this paper offers an audacious hypothesis: Our brain, during the its evolution, has developed dreams as a way to solve over-fitting
Since we’re learning from a limited samples of data in the real world, chances of overfitting (I call it judgement) goes higher. In ML we inject randomness and noise to avoid overfitting. Hoel theory can explain why our dreams are so sparse & hallucinatory
Great read. Note if you're not going to read it that you yourself should not eat 35 eggs per day because these patients had calorie requirements of a little under 7000.
It explain the basic concept about fairness in ML. Very practical exemple in my domain knowledge that show the trade-off between fairness of an algo and overall performance (money). Really make you see what may go wrong with bias in ML. It shows, in my opinion, why we will have to regulate ML as corporation aren't really incentivized to deal with fairness. It also shows that there is different notions of fairness. So there will always be something that feel unfair and also doing something can always be interpreted as positive discrimination.
Deep brain optogenetics without intracranial
surgery
"Achieving temporally precise, noninvasive control over specific
neural cell types in the deep brain would advance the
study of nervous system function. Here we use the potent
channelrhodopsin ChRmine to achieve transcranial photoactivation
of defined neural circuits, including midbrain and
brainstem structures, at unprecedented depths of up to 7 mm
with millisecond precision. Using systemic viral delivery of
ChRmine, we demonstrate behavioral modulation without
surgery, enabling implant-free deep brain optogenetics."
Even if a bit impractical in some regrards, I think an operating system/cloud that you interact with like a database is something we should aspirationally strive for. We're spending too much time gluing things together and not enough time being productive. Databases are great at tracking and describing resources (much better than YAML) and stored procedures that are like Lambdas would be neat.
DreamCoder: Growing generalizable, interpretable knowledge with wake-sleep Bayesian program learning
https://arxiv.org/abs/2006.08381
A killer paper presenting an algorithm capable of inductive learning. ("DreamCoder solves both classic inductive programming tasks and creative tasks such as drawing pictures and building scenes. It rediscovers the basics of modern functional programming, vector algebra and classical physics, including Newton's and Coulomb's laws.")
This year is the first year I actually started reading some papers about Computer Graphics. This old paper by Ken Perlin from 1985 "An Image Synthesizer" Inspired me a lot. It really showed me that if you have a real deep understanding of some basic principles like the sin function. you can create beautiful things.
I only properly read the Lottery Ticket Hypothesis paper[1] properly at the start of 2020.
I think it's going to be years before we understand this properly, but in 2020 we are beginning to see practical uses.
At the moment I think it's a toss up: it's either going to be a curiosity that people read, have their mind exploded but can't do anything with or else it's a good chance to be the most influential deep learning paper of the decade.
This one is from 2008. There's this method from statistics called PCA that let's you reduce high dimensional data into a few (usually meaningless) newly-fabricated dimensions. It's useful to visualize complex data in 2d space.
In this paper, they did that with genes. And the 2d space that was left wasn't meaningless at all. It accurately recreated map of Europe.
i'n not an expert and only started reading papers this year but this is where i got a few papers that I read from:
- 2minutepapers on youtube - mostly ML and CG
- Google scholar for finding specific things.
- The shadertoy discord channel. People reference to CG papers a lot there.
The last thing is that I save every paper that i see mentioned as a pdf file in a seperate folder on my PC. I use a combination of rip grep all + recoll to search them.
I've been reading up on the object capability security model a lot recently, and was pointed to this paper... I was hooked. A really compelling security model almost from first principles.
We administered 100 mg MDMA or placebo to 20 male participants in a double-blind, placebo-controlled, crossover study.
Cooperation with trustworthy, but not untrustworthy, opponents was enhanced following MDMA but not placebo.
Specifically, MDMA enhanced recovery from, but not the impact of, breaches in cooperation.
During trial outcome, MDMA increased activation of four clusters incorporating precentral and supramarginal, gyri, superior temporal cortex, central operculum/posterior insula, and supplementary motor area.
MDMA increased cooperative behavior when playing trustworthy opponents.
Our findings highlight the context-specific nature of MDMA's effect on social decision-making.
While breaches of trustworthy behavior have a similar impact following administration of MDMA compared with placebo, MDMA facilitates a greater recovery from these breaches of trust.
'What if we treated AI as equals, like other human beings, not as tools or, worse, slaves to their creators?' That's the premise to this paper, which is a wonderful provocation. It's a really important consideration too, when you consider how many of our decisions we're asking machine sentience to make for us. If algorithmic bias were a human judge, they'd be thrown out of court (you'd hope).
This study is extraordinary in terms of the extent to which it reveals just how little we fully understand about what is actually taking place in the genetic process. Quote from the end of the paper's introduction section: "In particular, the demonstration of highly efficient splicing in mammals in the absence of transcriptional pausing causes us to rethink key features of splicing regulation."
https://www.biorxiv.org/content/10.1101/2020.02.11.944595v1....
Tactical Periodization is a training program credited for the success of at least one star. This paper says, in short, there are no scientific studies proving it works. We'd give it the benefit of the doubt and wait for proof, but it's been 20 years and still nothing.
Hey, let's get some interesting humanities papers into the mix, since thanks to COVID I had a lot of extra reading time and "best" is purely subjective:
When JAmes C. Scott wrote about infrapolitcs in his 1990 work "Domination and the Arts of Resistance: Hidden Transcripts" (https://www.jstor.org/stable/j.ctt1np6zz) and described it as a sort of political resistance that never declares itself and remains beneath what the dominant group can properly perceive until the power shift actually starts to happen, he probably didn't think of a case where the undeclared politics is so literally something not meant to be seen. The theme is very much a progression of how slowly women were able to establish even what parts of their body can be sexualized or not sexualized, and culminates in a sudden burst, or power shift, in the 1910s-1930s after centuries of aggregating individual choices and entirely unseen acts. This particular revolution managed to happen almost entirely outside of organization and public view and while it's by no means over, the progress made in the last twenty years covered in the paper really show how productive the aggregation of individual acts of resistance, done without any open plans, can still bring about so much change. It also showed the limits of such movements, particularly whe the dominant group rhas an active interest in preserving that status quo.
These two were both written by UCLA Law professor Johanna Schwartz over the course of about a year and half from 2017-2018, and really got a lot of attention this year when a lot of people for the first time asked "why does it seem impossible to actually hold abusive police to some degree of personal responsibility?" Having worked at a public defender's office and then on federal CJA cases (essentially federal defense work when there is more than one codefendant and the federal defenders would have a conflict of interest defending both), the abusive nature of policing was very much something that I saw constantly for years but it's difficult to quantify just how little potential consequence a police officer may actually face because nobody had done the shoeleather work to collect the data, and police departments tend to have opacity written into their contracts. The actual data collected by Schwartz demonstrating how the multiple layers of shielding negotiated into police contracts and just how much indemnification, which is actually illegal in many jurisdictions but universally ignored, pushes any potential liability onto taxpayers directly, creating a situation where victims' taxes are just getting looped back into the settlements they receive. There are a lot of problems in the criminal justice and really any carceral system this country runs, and most of it are poorly documented on a systemic level and difficult to quantify. It's nice to see that someone put in the work to make the picture a little clearer, as practitioners tend to be entirely focused on their clients to do research like this and this is a particularly unglamorous field of research.
Some CogSci & Neuro papers I found interesting in 2020:
Constantinescu, Alexandra O., Jill X. O’Reilly, and Timothy EJ Behrens. "Organizing conceptual knowledge in humans with a gridlike code." Science 352.6292 (2016): 1464-1468.
Kriegeskorte, Nikolaus, and Katherine R. Storrs. "Grid cells for conceptual spaces?." Neuron 92.2 (2016): 280-284.
Klukas, Mirko, Marcus Lewis, and Ila Fiete. "Efficient and flexible representation of higher-dimensional cognitive variables with grid cells." PLOS Computational Biology 16.4 (2020): e1007796.
Moser, May-Britt, David C. Rowland, and Edvard I. Moser. "Place cells, grid cells, and memory." Cold Spring Harbor perspectives in biology 7.2 (2015): a021808.
Quiroga, Rodrigo Quian. "Concept cells: the building blocks of declarative memory functions." Nature Reviews Neuroscience 13.8 (2012): 587-597.
Stachenfeld, Kimberly L., Matthew M. Botvinick, and Samuel J. Gershman. "The hippocampus as a predictive map." Nature neuroscience 20.11 (2017): 1643.
Buzsáki, György, and David Tingley. "Space and time: The hippocampus as a sequence generator." Trends in cognitive sciences 22.10 (2018): 853-869.
Umbach, Gray, et al. "Time cells in the human hippocampus and entorhinal cortex support episodic memory." bioRxiv (2020).
Eichenbaum, Howard. "On the integration of space, time, and memory." Neuron 95.5 (2017): 1007-1018.
Schiller, Daniela, et al. "Memory and space: towards an understanding of the cognitive map." Journal of Neuroscience 35.41 (2015): 13904-13911.
Rolls, Edmund T., and Alessandro Treves. "The neuronal encoding of information in the brain." Progress in neurobiology 95.3 (2011): 448-490.
Fischer, Lukas F., et al. "Representation of visual landmarks in retrosplenial cortex." Elife 9 (2020): e51458.
Hebart, Martin, et al. "Revealing the multidimensional mental representations of natural objects underlying human similarity judgments." (2020).
Ezzyat, Youssef, and Lila Davachi. "Similarity breeds proximity: pattern similarity within and across contexts is related to later mnemonic judgments of temporal proximity." Neuron 81.5 (2014): 1179-1189.
Seger, Carol A., and Earl K. Miller. "Category learning in the brain." Annual review of neuroscience 33 (2010): 203-219.
Neurolinguistics:
Marcus, Gary F. "Evolution, memory, and the nature of syntactic representation." Birdsong, speech, and language: Exploring the evolution of mind and brain 27 (2013).
Dehaene, Stanislas, et al. "The neural representation of sequences: from transition probabilities to algebraic patterns and linguistic trees." Neuron 88.1 (2015): 2-19.
Fujita, Koji. "On the parallel evolution of syntax and lexicon: A Merge-only view." Journal of Neurolinguistics 43 (2017): 178-192.
Nice to see someone found Bird song book interesting.
Here's a cool paper on bird song during lockdown.
Derryberry, Elizabeth P., et al. Singing in a silent spring: Birds respond to a half-century soundscape reversion during the COVID-19 shutdown (2020) (DOI: 10.1126/science.abd5777)
Thanks for putting this list together! I took some cog sci courses in college and have been meaning to dive more into the research around it and these papers seem like a good place to start. I expect to run into lots of jargon and concepts I don't understand. Would it be possible for me to reach out to you for questions when I'm unable to make sense of the content after having researched the unknown concepts online?
Pulvermüller, Friedemann. "Words in the brain's language." Behavioral and brain sciences 22.2 (1999): 253-279.
Pulvermüller, Friedemann. "Brain embodiment of syntax and grammar: Discrete combinatorial mechanisms spelt out in neuronal circuits." Brain and language 112.3 (2010): 167-179.
Lau, Ellen F., Colin Phillips, and David Poeppel. "A cortical network for semantics:(de) constructing the N400." Nature Reviews Neuroscience 9.12 (2008): 920-933.
Bellmund, Jacob LS, et al. "Navigating cognition: Spatial codes for human thinking." Science 362.6415 (2018).
Peer, Michael, et al. "Processing of different spatial scales in the human brain." ELife 8 (2019): e47492.
Kriegeskorte, Nikolaus, and Rogier A. Kievit. "Representational geometry: integrating cognition, computation, and the brain." Trends in cognitive sciences 17.8 (2013): 401-412.
Mok, Robert M., and Bradley C. Love. "A non-spatial account of place and grid cells based on clustering models of concept learning." Nature communications 10.1 (2019): 1-9.
Chrastil, Elizabeth R., and William H. Warren. "From cognitive maps to cognitive graphs." PloS one 9.11 (2014): e112544.
-->On graph navigation and 'rich club' networks:
Watts, Duncan J., and Steven H. Strogatz. "Collective dynamics of ‘small-world’networks." nature 393.6684 (1998): 440-442.
Kleinberg, Jon M. "Navigation in a small world." Nature 406.6798 (2000): 845-845.
Ball, Gareth, et al. "Rich-club organization of the newborn human brain." Proceedings of the National Academy of Sciences 111.20 (2014): 7456-7461.
Malkov, Yury A., and Alexander Ponomarenko. "Growing homophilic networks are natural navigable small worlds." PloS one 11.6 (2016): e0158162.
Givoni, Inmar, Clement Chung, and Brendan J. Frey. "Hierarchical affinity propagation." arXiv preprint arXiv:1202.3722 (2012).
--> On concept formation, memory & generalization
Bowman, Caitlin R., and Dagmar Zeithamova. "Abstract memory representations in the ventromedial prefrontal cortex and hippocampus support concept generalization." Journal of Neuroscience 38.10 (2018): 2605-2614.
Garvert, Mona M., Raymond J. Dolan, and Timothy EJ Behrens. "A map of abstract relational knowledge in the human hippocampal–entorhinal cortex." Elife 6 (2017): e17086.
Collin, Silvy HP, Branka Milivojevic, and Christian F. Doeller. "Hippocampal hierarchical networks for space, time, and memory." Current opinion in behavioral sciences 17 (2017): 71-76.
Kumaran, Dharshan, et al. "Tracking the emergence of conceptual knowledge during human decision making." Neuron 63.6 (2009): 889-901.
DeVito, Loren M., et al. "Prefrontal cortex: role in acquisition of overlapping associations and transitive inference." Learning & Memory 17.3 (2010): 161-167.
Gallistel, Charles Randy, and Louis D. Matzel. "The neuroscience of learning: beyond the Hebbian synapse." Annual review of psychology 64 (2013): 169-200.
Martin, A., and W. K. Simmons. "Structural Basis of semantic memory." Learning and Memory: A Comprehensive Reference. Elsevier, 2007. 113-130.
Zeithamova, Dagmar, Margaret L. Schlichting, and Alison R. Preston. "The hippocampus and inferential reasoning: building memories to navigate future decisions." Frontiers in human neuroscience 6 (2012): 70.
Tenenbaum, Joshua B., and Thomas L. Griffiths. "Generalization, similarity, and Bayesian inference." Behavioral and brain sciences 24.4 (2001): 629.
"Estimating the number of unseen species: A bird in the hand is worth log(n) in the bush" https://arxiv.org/abs/1511.07428 https://www.pnas.org/content/113/47/13283
It deals with the classic, and wonderful, question of "If I go and catch 100 birds, and they're from 20 different species, how many species are left uncaught?" There's more one can say about that than it might first appear and it has plenty of applications. But mostly I just love the name. Apparently PNAS had them change it for the final publication, sadly.