The Building Blocks of Interpretability

colah3 · on March 6, 2018

Hello! I'm one of the authors; we'd be happy to answer any questions.

Make sure to check out our library and the colab notebooks, which allow you to reproduce our results in your browser, on a free GPU, without any setup:

https://github.com/tensorflow/lucid#notebooks

I think that there's something very exciting about this kind of reproducibility. It means that there's continuous spectrum of engaging with the paper:

Reading <> Interactive Diagrams <> Colab Notebooks <> Projects based on Lucid

My colleague Ludwig calls it "enthusiastic reproducibility and falsifiability" because we're putting lots of effort into making it easy.

taliesinb · on March 6, 2018

Hello! I'm a fan of this latest publication, distill.pub, and your previous work.

I have three questions:

1. Flexible interfaces based on the kinds of grammars you describe in this post feel like their time have come, as an important part of scientific communication, but also as a tool to create, test, and update hypotheses themselves. Entire experiments could be written in the language of these grammars, where transformations and operations directly produce the artifacts that support scientific conclusions: things like means and variances, distributions, statistical tests, scatter plots, and the like. What are your thoughts on that? Is that a direction in which Lucid is interested in going?

2. I'm interested in how to automate the creation of such linked interfaces, whether they be for convnets, language models, seq2seq type models, etc. The idea would be to have the user write a query in a DSL that explicitly connects specific interior activations (on a given axis/axes) with gradients and visualization techniques to produce a linked interface. I haven't yet played around with Lucid -- is Lucid already such a DSL for tensorflow? Or does it rather collect various instantiations of the general grammar that have proved useful?

3. What are the best examples you know of linked interfaces and grammar-like approaches to visualization inspiring important discoveries? Within deep learning, it feels like the slam-dunk example is OpenAI's sentiment neuron, although that was a fairly simple form of visualization. I'd be curious to find more examples, both inside and outside the field of deep learning.

cs702 · on March 7, 2018

Hi Chris! -- I have four comments:

First, as always, I have to say THANK YOU, to you, and to the other authors, and to your various helpers, for putting this together, with no obvious goal other than the public good. :-)

Second, your idea of (a) taking all activation values in a trained DNN in response to a particular sample, (b) reshaping all these values into a single giant matrix, and (c) factorizing this giant matrix to identify "neuron groups" (i.e., some kind of low-rank approximation) that most closely explain the behavior of the DNN for each particular sample... is a brilliant idea. In hindsight, it seems like an obvious thing to do, but I don't think I've seen anyone else do it before. I suspect this kind of whole-DNN matrix decomposition will be widely applicable across architectures and modalities, not just for convnets in visual tasks.

Third, I think this barely scratches the surface of the kinds of UI-driven "interpretation tools" that ultimately will be needed to enable (non-technical) human beings be able to associate or attribute DNN behavior to "salient ideas," including those salient ideas for which human beings currently lack descriptive terminology. This is exciting stuff, and I can't wait to see what kinds of interpretation tools (and UIs) you and others come up with in the near future.

Finally, I can't help but wonder if the behavior of state-of-the-art AI systems has already exceeded, or is on the verge of exceeding, human capacity to interpret it. For example, what if the number of "salient ideas" current AI systems can discover vastly exceeds the number of distinct salient ideas a human brain can distinguish and work with? What is your view or opinionated guess on this?

colah3 · on March 7, 2018

> factorizing this giant matrix to identify "neuron groups" ... is a brilliant idea.

As with many ideas in this article, Alex deserves all the credit. He's been doing that trick internally at Google for years. (In my experience, if Alex is excited about an idea, >50% odds you'll realize it's a super important idea 2 years later. :P )

I think there's I saw an instance of someone doing PCA on conv net activations to make heat maps. We should try to dig that up and cite it.

> I suspect this kind of whole-DNN matrix decomposition will be widely applicable across architectures and modalities, not just for convnets in visual tasks.

Absolutely! (I think that most of the ideas in our article are pretty general. :) )

> This is exciting stuff, and I can't wait to see what kinds of interpretation tools (and UIs) you and others come up with in the near future.

Thanks! We're super excited as well!

> Finally, I can't help but wonder if the behavior of state-of-the-art AI systems has already exceeded, or is on the verge of exceeding, human capacity to interpret it... What is your view or opinionated guess on this?

Oh my. That's a super interesting and deep question. Wild wild speculation ahead. Please take everything I say with a big grain of salt.

My first comment is that a model can be hard to understand because it's too stupid, in addition to being hard to understand because it's exceeding us. I suspect that if one could have an "interpretability curve" of how easy models are to understand vs model performance -- it's kind of hard to make this sensical, because both variables are actually these really nuance high-dimensional things, but imagine you could -- it would go up as you approach human performance and actually peak after it, before eventually declining again. This intuition is driven by the thought that early superhuman performance for tasks humans are already good at is probably largely about having really crisp, much more statistically precise, versions of our abstractions. Those crisp versions of our abstractions are probably easier to understand than the confused ones.

But, of course, that one dimensional view is a gross simplification. I suspect that something that will be important in future thought about this is "alien abstractions" vs "refined abstractions."

By a "refined abstraction" I mean something like ears in GoogLeNet. You see, GoogLeNet has dozens of detectors for different kinds of ear -- a much richer vocabulary of ear types than I can articulate -- and knows a lot about how those should influence class probabilities. Although I can't understand the detailed nuance of each ear detector, I can get the general gist and verify that it has reasonable consequences. Conversely, by an "alien abstraction" I mean a feature that I don't have a corresponding idea for. These are much harder for us to deal with.

Both "refined" and "alien" abstractions could give superhuman performance. We're in a much better state when refined ones dominate. For visual tasks that humans are already good at, I expect refined abstractions to dominate for a while. In other domains, I have a lot less confidence.

I think there's a lot of nuance about where interpretability will easily scale vs have severe challenges is super subtle. I have a draft essay floating around on the topic. Hopefully I'll get it out there someday.

cs702 · on March 7, 2018

> As with many ideas in this article, Alex deserves all the credit. He's been doing that trick internally at Google for years. ... I think there's I saw an instance of someone doing PCA on conv net activations to make heat maps. We should try to dig that up and cite it.

Not surprised to hear this. It really does seem like an obvious thing to do... yet no one has taken the time to look carefully/methodically at it until now... probably because everyone is too busy with other, newer, flashier things.

> Oh my. That's a super interesting and deep question. Wild wild speculation ahead. Please take everything I say with a big grain of salt.

Thank you. Love it!

> My first comment is that a model can be hard to understand because it's too stupid, in addition to being hard to understand because it's exceeding us ... this intuition is driven by the thought that early superhuman performance for tasks humans are already good at is probably largely about having really crisp, much more statistically precise, versions of our abstractions. Those crisp versions of our abstractions are probably easier to understand than the confused ones.

Yes, that makes sense to me -- but only as long as we're talking about tasks at which humans are already good. I'm not so sure this is/will be the case for tasks at which humans underperform state-of-the-art AI -- such as, for example, learning to recognize subtle patterns in datacenter energy usage necessary for being able signifcantly to lower datacenter energy consumption, or learning to recognize new kinds of Go-game-board patterns that likely confer advantages to a Go player.

> ...I suspect that something that will be important in future thought about this is "alien abstractions" vs "refined abstractions." ... by an "alien abstraction" I mean a feature that I don't have a corresponding idea for. These are much harder for us to deal with. ... ...For visual tasks that humans are already good at, I expect refined abstractions to dominate for a while. In other domains, I have a lot less confidence.

Yes, that makes sense to me too.

Leaving aside the possibility that there might be cognitive tasks beyond the reach of human beings, I have an inkling that we're going to run into more and more of "alien abstractions" or "alien salient ideas" as AI is used for more and more tasks at which human beings do poorly. In particular, I suspect "alien abstractions" will become a serious issue in many narrow domains for which humankind has not invested the numerous man-hours necessary to learn to recognize (let alone name!) a sufficient large number of "refined abstractions."

As an analogy, I imagine the abstractions learned by AI systems in those domains will be as foreign to human beings as the 50+ words Inuit tribes have for different kinds of snow are to you and me -- and probably more so.[0]

> I think there's a lot of nuance about where interpretability will easily scale vs have severe challenges is super subtle. I have a draft essay floating around on the topic. Hopefully I'll get it out there someday.

I can see that, given the computational complexity involved. (I suspect all those new "randomized linear algebra" algorithms will prove useful here.)

Looking forward to reading the article if and when you get around to it. Thank you!

--

[0] https://www.washingtonpost.com/national/health-science/there...

ludwigschubert · on March 6, 2018

We need to go from it being special to open source code for a paper to making it almost seamless to interact with a presented idea, to try it in new contexts, to build upon it, and, yes, to falsify it.

When we introduce examples into an article we do so to make it easier to understand our message, but this opens us up to criticism of cherry picking our examples. We take a first step by allowing readers to choose from a set of examples, but real progress is open sourcing the code behind each diagram, and making it seamless for readers to run.

If you primarily care about finding truth, then you don't only need to make reproduction and falsification possible, you need to make it easy.

Nzen · on March 7, 2018

While I agree in principle, in practice this implies a maintenance burden. Who should the burden fall on? Some researchers leave to industry positions. Some leave for family obligations.

J Scoggins published an expert system wavefront scheduler [0][1] six years ago. Now, he's working on an expert system processor simulator [2]. Should requests on his earlier work take time from his current work? If he were to release a docker image, or whatever, of his setup while writing these tools is that enough?

Obviously, there's a spot to shoot for between tossing binaries over the wall and a month long sabbatical to write documentation and example code for something no one may ever use.

I have to ask, because I didn't pursue academia. Would you, as a grad student (per your github), reject or welcome a policy that insisted that the last week, or whatever, of the development window is to document the tools of the study?

[0] https://github.com/DrItanium/WavefrontScheduling

[1] https://cpp-primo.hosted.exlibrisgroup.com/primo-explore/ful...

[2] https://github.com/DrItanium/maya

nmca · on March 7, 2018

Hey - I thought that this was interesting work, but just wanted to jump in and say thanks so much for the effort in reproducibility. This should be the new standard, but putting in the effort to do it now is improving the world at relatively little personal benefit. So thanks!

brylie · on March 7, 2018

I couldn't read the article on a mobile screen. The text rendered at about 3px high. Please consider using a mobile-friendly CSS design.

jashkenas · on March 6, 2018

Hi Chris!

(Apologies if it's putting you too much on the spot, but I'm actually curious in hearing about this...)

My question for you is how you personally navigate the ethical issues that are raised by today's news about Google working with the U.S. military to help classify imagery from drone footage. (1) (2)

In your tweet about this paper, you write:

It qualitatively changes the questions we can ask. One basic example:

neuron 426 fired ~= useless

[floppy ear] fired = very interesting!

But one can just as easily imagine:

neuron 183 fired ~= useless

[young man fleeing] fired = very interesting!

Are there conversations within Google Brain about the appropriate applications of this research you can share with us? Do you feel like this class of military technology is going to be developed one way or the other, so there's no use in worrying about speeding it along?

1: https://gizmodo.com/google-is-helping-the-pentagon-build-ai-...

2: https://theintercept.com/2018/03/06/google-is-quietly-provid...

colah3 · on March 6, 2018

Hi Jash!

I focus my work on technical research directions I believe increase the probability of advances in AI being outstanding for all of humanity. I'm fortunate that Google provides me resources to do so, with broad autonomy to work on what I think is important.

My main focus is on building techniques that allow humans to understand how neural networks make decisions, because I think that will be a powerful tool in making sure increasingly automated systems are aligned with society's values -- for example, being fair -- and are making decisions for reasons we endorse. I also spend a lot of time thinking about accident risk and safety. I believe this work is robustly good for the world.

Of course, lots of concerns about AI can't be addressed by the kind of technical research I work on. I think technical research into interpretability, robustness, safety, fairness, etc... is necessary but not sufficient for the kind of world we want to see. There are also tough questions of public policy, international governance, and scientific ethics -- such as issues around the abuse of AI systems. On these kind of issues, I'd largely defer to people who think deeply about these topics, like Allan Dafoe (Oxford), Michael Page (OpenAI), and Tim Hwang (Ethics & Governance of AI Fund).

jashkenas · on March 6, 2018

Thanks for the answer – I really appreciate it.

YeGoblynQueenne · on March 7, 2018

If deep learning researchers had a strong theoretical understanding of their own field, a "grand unified theory of deep learning" (and possibly all learning) then they wouldn't need special tricks to do explanation .

Unfortunately, the deep learning field suffers from what John McCarthy used to call the "Look ma, no hands" approach to AI: that's when you get a computer to do something that hasn't beeen done before with a computer and publish a paper to announce it without any attempt to identify and study any intellectual mechanisms.

So the majority of deep learning papers are result papers: someone tweaks an existing architecture, or invents a new one, to do something new, or beat the state-of-the-art results. Theoretical papers are very few and far between and often come from outside the field (like the Renormalisation Group paper, or the papers on Information Bottleneck Theory).

I don't see how work like the one described in the article bucks the trend. Visualisation may be intuitive, but any two people can see completely different things in the same image, especially complex images of large scale activations. The result is an interpretation method that's up to, well, personal interpretation. What's more- this only works with vision, where activations can sort of map to images. It's no use for, say, text, sound, or other types of data (despite what the article says).

Articles like this tell me that deep learning researchers have basically given up on understanding how their own stuff works in the course of their careers and simply accept that their success must come from beating benchmarks and producing pretty graphs.

A pitty.

colah3 · on March 7, 2018

I'd like to offer a more optimistic counter view.

I think it's likely that deep learning has stumbled upon something deep and profound. And now we're at the point of struggling to make sense of it. Of course things are messy: we're knee deep in the business of trying to start to sort things out.

In the optimistic view, the ideas we're grappling with -- ideas like feature visualization, attribution, etc -- might be the seeds of deep abstractions like calculus or information theory. (Of course, these early versions are messy! For example, early calculus was deeply criticized by figures like Berkeley, and took more than a century to put on firm footing via the introduction of limits.) Powerful, novel abstractions may not look like what you expect at first.

I do think it's very reasonable of you to be skeptical, of course. Most attempts to craft new ways of thinking about hard problems don't pan out. But I think it's worth pushing really hard on them, because if they do they're very valuable. I feel like we have initial promising results -- give us a decade to see where they go! :)

But we could also be totally barking up the wrong tree. :)

> What's more- this only works with vision, where activations can sort of map to images. It's no use for, say, text, sound, or other types of data (despite what the article says).

That's a reasonable concern. We didn't give any demonstrations of our methods outside vision in the article. I can say that we have done very early-stage prototypes that suggestion similar interfaces work in other domains. Of course, instead of images you get symbols in what ever domain you're working with -- such as audio or text.

YeGoblynQueenne · on March 7, 2018

Thank you for taking the time to write a substantial reply!

You might be right about deep learning having stumbled upon something deep and profound. Or it may just be the case of "big machine performs well at task that is hard for humans". Like you say, we will have to wait and see.

It's just that we won't be seeing much, unless the field focuses real effort on the task of coming up with some kind of "calculus of (deep) learning". As things go right now, it might take more than ten years to see the progress you're hoping for.

On a personal note, I should say that I do quite like your idea, in principle. But that's because you're proposing a grammar of design spaces; I think we should use grammars everywhere :P

On the demonstration of your technique in other domains than vision- well, that would be really interesting to see. I watched a presentation by a gentleman called Willem Zuidema recently, whose work is on computational linguistics. His team had worked to interpret their deep learning models by visualising their hidden unit activations; he said that it was extremely painful and didn't scale well (he was talking at CoCoSym 2018, a workshop that featured much work on the prospects of combining deep learning with symbolic techniques, mainly for interpretation). If your method can work well in other domains it will definitely be useful to many people. It's still not the kind of theoretical result I'm hoping for, but it would be nice to see a principled way to extract symbolic representations from a continuous space - if that's what you're talking about.

Anyway, I'll keep an eye out :)

cosmic_ape · on March 7, 2018

Could you point to some reasons why you'd think that there is anything deep and profound about deep learning? As far as the evidence I know goes, its very shallow as a scientific concept. All there is, is the idea of using convolutions for image feature extraction, which predates "deep learning" by decades. And there is little evidence that the nets do anything more than memorization, which is hardly profound too.

cs702 · on March 7, 2018

> Articles like this tell me that deep learning researchers have basically given up on understanding how their own stuff works in the course of their careers and simply accept that their success must come from beating benchmarks and producing pretty graphs. ...a pity

This is an overgeneralization, and not a fair one, in my view.

The real issue you're bringing up, as I see it, is that we have yet to come up with a set of mental tools necessary for reasoning about deep learning in a rigorous way. In other words, I believe our position is analogous to that of, say, educated people in the early 17th century who could not reason rigorously about Zeno's "paradoxes of motion" because the ideas of calculus had not yet been proposed by Newton and Leibniz.

A considerable amount of human intelligence and ingenuity goes into the design and development of novel deep learning systems. I speak from experience. Every tweak and design decision in every model has a reason for being. These deep learning models are not good at what they do by accident!

YeGoblynQueenne · on March 7, 2018

You summarise my comment very well; I'm sorry it feels unfair to you but I hope you can take it as an example of the frustration of outsiders looking in and I think I make it clear why I feel that way. I'm also pretty sure you're right and that the performance of deep learning models is not down to pure chance, but that intuition does us no good while we can't say why they work well.

I understand that a fair bit of thought goes into deep learning models- but what kind of thought is that? I get the feeling that most deep learning papers simply describe the most successful of a series of knob-turning experiments. "We turned this knob here and performance went up by 0.05%!". I can't help but imagine all the time wasted turning the "wrong" knobs and trying the "wrong" things- i.e. tweaks to models that didn't pan out as hoped and that never get published.

Having a "calculus of deep learning" would save a lot of the time wasted fumbling around in the dark only to come up with a dead end. Experiments could be guided by principle, people would know what is promising and what "might be worth a try". New researchers or outsiders would have a chance to get a grip on the gigantic bibliography armed with analytical tools, rather than forced to painstakingly develop an intuition that can be hardly communicated to their peers.

The trouble is that there is nothing like that "deep learning calculus" yet and the whole field seems to me (as an outsider, of course) to be much more interested in finding what works, than understanding why it works.

cs702 · on March 7, 2018

Yes, I agree: the kind of "deep learning calculus" you're talking about does not exist today.

> ... I get the feeling that most deep learning papers simply describe the most successful of a series of knob-turning experiments. "We turned this knob here and performance went up by 0.05%!"

While guessing and tinkering is necessarily part of it, for the most part we're not talking about "random knob-turning" experiments. Design decisions are usually motivated by (a) mathematical intuition (e.g., if we have reason to believe the high-level abstractions to be learned have 'spatial meaning,' we might try using convolutions) and/or (b) numerical evidence (e.g., if gradients backpropagated during training are exploding or appear to be too high in a particular group of layers, we might try clipping gradients or using lower learning rates in that group of layers during training.)

> ...the whole field seems to me (as an outsider, of course) to be much more interested in finding what works, than understanding why it works.

That characterization is not fair, in my view.

There's a growing list of things we understand. For example: We understand why and how convolutions work. We understand why and how LSTM and GRU cells work. We understand why and how ReLUs backpropagate gradients so well. We know that dropout reduces to a scaled form of L2 regularization in linear models. We understand why SELUs are self-normalizing (in fact, they're constructed to be self-normalizing). We have a good intuition as to why initializing recurrent-layer weights with identity matrices works so well. Our understanding as to why residual layers work so well in really deep networks has improved considerably. Our understanding as to why SGD works so well in non-convex problems with many local minima has improved considerably. We have a good inkling as to why square-error and Wasserstein losses are less susceptible to 'mode collapse' for training generative adversarial networks.

These are just some examples I can rattle quickly, off the top of my head, in no particular order. This short list doesn't really do justice to how much our understanding of what works in deep learning and why has improved over, say, the past five years.

That said, we're still in the early days of AI, and there's a lot of work to be done to improve our understanding of it. Like Chris Olah, I believe (but can't prove, of course) that we've stumbled into something new and profound.[0][1]

--

[0] For example, there appear to be deep connections between the symmetries that define the physical laws of our universe and the ability of deep learning models to work so well with what I like to call "ludicrously high dimensional" data. The fact that certain simple convnets are equivalent to certain Renormalization Group transformations is evidence of the existence of these connections, I think (but can't prove).

[1] As another example, to the best of my knowledge there is no good, formal, precise way of measuring (let alone discussing!) the kind of "hierarchically decomposable structure that is highly robust to loss" that is prevalent in natural high dimensional data -- the kind of data for which deep learning works so well. Simply put, measures like Shannon entropy or Kolmogorov complexity don't tell us much about the "hierarchical complexity" of, say, objects in a photo. The fact that such a measure doesn't exist is evidence that we've stumbled into something new and profound, I think (but can't prove).

YeGoblynQueenne · on March 8, 2018

Of course there is some work in that direction- but it is not the majority of the field, not by a long shot. The advances you list are interesting and they are possibly the building blocks of a larger theory, but they remain few and far between- occasional, disconnected contributions on the side of the main work in deep learning. Without a concentrated effort to build on them, rather than chase records, we will not see a "calculus of deep learning" any time soon.

andbberger · on March 7, 2018

Awesome - glad that the folks at the big G are tackling this problem. IMO the field is in sore need of tools that help open the black box - I think it's going to be a big challenge to design such tools in a way that they are both flexible and powerful[1], but composeable building blocks certainly feel like the right start.

[1] This is a big challenge for DL tools in general - if what you're trying to do can't be expressed out of standard tensorflow ops, you're going to have a bad time.

colah3 · on March 7, 2018

Thanks! :)

groceryheist · on March 7, 2018

The ideas from this article are really cool, and the design is beautiful. I see these techniques as providing the ability to partially interpret models. While clearly useful to practitioners seeking an intuition for what their models learn, it appears we are still very from the ability to thoroughly audit deep learning computer vision models.

I wonder if in the long run, making models that are both effective and interpretable can be done by first building a black box model, and then interpreting as much as possible it using clever ideas like those from the article. The interpretations of the black box model can inform the design of a relatively simple bespoke model. The bespoke model may never outperform the black box at prediction tasks, but in many applications the ability to perform audits and estimate of uncertainty should be worth it.

sgentle · on March 7, 2018

Wow. You ever get your mind blown three different ways at once? I don't know if I'm more impressed with the depth of ideas in this paper, the exquisite clarity of its presentation, or the glimpse this journal gives into the science of the future: open, participatory, technologically empowered, and anchored in the enrichment of human understanding. This is great and important work.

I was looking at the previous paper on feature visualisation (https://distill.pub/2017/feature-visualization/) and I couldn't help but notice the parallels between feature examples in neural networks and test cases in traditional programming.

Examples drawn from real datasets appear in both, as well as generating examples using optimisation processes (search-based software testing[0]), and optimising from real data (test data augmentation[1]). I even found something similar to the diversity-maximisation approach[2]. There are also some related ideas in the functional programming world that combine optimisation with constraints on the input domain (targeted property-based testing[3]) and do a similar kind of human-scale input reduction (counterexample reduction[4]).

More generally, maybe it makes sense to think of testing and interpretation as complementary ideas. Testing says "I think I understand this function's behaviour, but I want to examine its output and compare it against what I expect." Interpretation says "I think I understand this function's output, but I want to examine its behaviour and compare it against what I expect." Test cases are inputs that generate interesting outputs, and feature examples are inputs that generate interesting behaviour. In either case, the goal is to minimise the inputs and maximise interestingness, which it seems is best understood in terms of the relationship between behaviour and output rather than either one alone.

I'm curious if anyone's looked into this. Maybe there are some neat ways to apply techniques from one to the other, or even combine the two?

[0] https://pdfs.semanticscholar.org/67a9/ca5a33e3ab4c2300cdcfaa...

[1] http://crest.cs.ucl.ac.uk/fileadmin/crest/sebasepaper/YooH08...

[2] https://pdfs.semanticscholar.org/a261/87634f842919ef53d0da4f...

[3] http://proper.softlab.ntua.gr/papers/issta2017.pdf

[4] https://www.cs.indiana.edu/~lepike/pubs/smartcheck.pdf