Why scientists need to be better at data visualization

stared · on Nov 15, 2019

When I was a Ph.D. student I was surprised that in academia (my field was: quantum information theory) the majority of researchers put zero effort, and thought, in presenting data in a clean and clear way. (As a reaction I devoted a section in my thesis to say what is data vis.) One my day "but it is for other specialists". Nope, it is not the problem. Of course, every data vis should be tailored for a specific audience. But there is typically zero though put their either. See section 3.1.2 from https://arxiv.org/abs/1412.6796.

From a more recent one ("Simple diagrams of convoluted neural networks", https://medium.com/inbrowserai/simple-diagrams-of-convoluted...):

"[In academic] research, visualization is a mere afterthought (with a few notable exceptions, including the Distill journal, https://distill.pub/). One may argue that developing new algorithms and tuning hyperparameters are Real Science/Engineering™, while the visual presentation is the domain of art and has no value. I couldn’t disagree more! Sure, for computers running a program it does not matter if your code is without indentations and has obscurely named variables. But for people — it does. Academic papers are not a means of discovery — they are a means of communication."

amitport · on Nov 15, 2019

In my experience, it's just about cost-effectiveness. Proper visualization is hard, researchers have a lot of ground to cover and theory and proper experiments are just more important (not only for journal reviewers). Sure, if have a big research group with man hours to spare someone should focus on that, but it's rare to have that capacity.

stared · on Nov 16, 2019

Sure, creating a really good custom data visualization takes time and effort. Using default color themes in a non-ugly library - only minimal care.

jacobolus · on Nov 16, 2019

Friendly note: That arXiv paper would benefit from better color choices in its diagrams. HSL/HSV models should generally be avoided in all contexts.

Beyond that, the paper colors reference links #0000ff (neon green). This makes them illegible and unpleasantly distracting. Try a darker (and less intense) color to improve legibility. The bright red links are also a bit distracting, but that’s more a matter of personal preference.

There are some good resources in https://courses.cs.washington.edu/courses/cse512/19sp/ which I just ran across recently. e.g. http://www.personal.psu.edu/cab38/ColorSch/ASApaper.html

jofer · on Nov 14, 2019

A key aspect that this misses is the target audience. The general public is not the target of most scientific visualizations. Instead, it's other scientists in the field. There are invariably lots of conventions and general historical trends that it's important to follow to communicate clearly.

For example, in seismology high sonic velocities are always shown in blue/cool colors and low sonic velocities are always shown in red/warm colors. This is non-ideal for colorblind users and confusing for other audiences (red != high value), but it's a near-universal convention. We can use a more perceptually uniform red to blue colormap, but keeping the warm/cool convention is very important.

If your audience is used to seeing a certain type data in a particular way, you'll confuse people if you don't follow that convention. Clear labels and legends are great, but people follow convention first, labels second.

When the goal is clear communication, following established convention is much more important than making a "better" visualization. That's not to say that these guidelines aren't very important, just that it's vital to keep your audience's expectations in mind.

> "Scientists also tend to follow convention when it comes to how they display data, which perpetuates bad practices."

In my experience, scientists follow convention because it enables clear, consistent communication with other scientists in the field. Violating convention is okay, but should be done with utmost care and the knowledge that you'll need to spend more time explaining things.

mnky9800n · on Nov 14, 2019

the article even talks about cultural expectations making visualizations harder to read if the cultural expectations are violated. Except that the article completely ignores your point. Science has a culture and violating that culture to make a "better" visualization must be done with care.

b0rsuk · on Nov 15, 2019

If scientists can't be bothered to read the legend first, who can?

rflrob · on Nov 14, 2019

While I absolutely agree that scientists need more formal training in data visualization, the claim that "few scientists take the same amount of care with visuals as they do with generating data or writing about it" doesn't ring true to me about the scientists in my field (genetics/genomics). There is a widespread recognition that effective figures are the strongest way to communicate your message, the question is what is the the best way to create those effective figures. Part of the problem is that as datasets get bigger, it's rarely sustainable to put a lot of care into each and every version of a plot, but automating creation of figures is really hard.

If I were going to start designing a course in creating scientific figures, I think I'd have a roughly even split between the psychophysics of visual perception (e.g. distinguishing between similar quantities of lengths/angles/colors/etc; designing for color-blind readers; ) and hands-on work in a real programming environment turning data to figures.

Thriptic · on Nov 14, 2019

I agree with everything you said and can confirm that in our (translational vascular bio) lab the bulk of effort spent on paper drafting was concerned with creation of high quality figures. Many scientists (myself included), will read a paper abstract and then head straight for the figures as they usually contain the highest density of data for the reader.

jfim · on Nov 14, 2019

> Part of the problem is that as datasets get bigger, it's rarely sustainable to put a lot of care into each and every version of a plot, but automating creation of figures is really hard.

That's actually a good reason to learn R and the ggplot2 package. Whenever I write a paper, what I do is that I make a quick shell script that invokes Rscript, with a simple R program that takes a CSV file and outputs a PDF file of the plot, which can be automatically loaded in LaTeX.

Whenever the data changes, it's just a matter of updating the CSV file and running the script that rebuilds the figures and the LaTeX document. As an added bonus, it makes keeping the data with the paper easy, since they're part of the same source control repository.

samch93 · on Nov 15, 2019

I can totally recommend knitr by the amazing yihui xie (https://yihui.org/knitr/ ) for this use case. It allows you to write R code chunks directly in your latex code and the output of the code (tables, plots) is then directly inserted in your pdf after compiling. Together with git and docker this gives a fully reproducible workflow!

airstrike · on Nov 14, 2019

Is there a reason you don't use RMarkdown in RStudio? It's built precisely for this use case

jfim · on Nov 15, 2019

Mostly because I wrote those scripts many years ago and I've been reusing them since. I'll look into RMarkdown for the next paper, thanks!

puttermesser · on Nov 14, 2019

Relatedly, here’s some cool visualization work that comes from genetics data viz.

http://scalable-insets.lekschas.de/

noobermin · on Nov 15, 2019

I'd say there is a range, as there always is. I've seen fantastic and clear visuals of FDTD simulations of laser interactions in my field in 3d, then, I've seen jet used to pcolor data that ranges from positive to negative. I would say though that a good enough fraction (not sure if it's greater than 50% but I wouldn't be surprised if it was) do care about making good figures.

psalminen · on Nov 14, 2019

The course you are describing is one which I took. It was called "Scientific Visualization", and was a CS course mainly taken by science majors. The meta-information describing color schemes and scales was by far the most interesting part of it.

hinkley · on Nov 14, 2019

Everyone needs to be better at data visualization. If for no other reason than to be an informed consumer. If Mark Twain had lived through informatics I'm sure it would be four ways to lie: "Lies, Damned Lies, Statistics, and Charts"

There are a lot of bullshit techniques used in data viz that are tantamount to lying and people are often sincerely shocked when you call them on it.

"I didn't do that on purpose," as if they didn't learn when they're 4 that it doesn't matter if you meant it if you did it. You still have to apologize and try to make amends.

Things people do from ignorance or malice:

Remove the origin on the graph and the relative height of the lines is skewed. 3d pie charts are 'larger' on the bottom half. Circle charts conflate diameter with area (humans are bad at judging area). On a log scale plot, a fat enough line can make anything look like a trend, because the end of the line is literally orders of magnitude wider than at the origin.

Friends don't let friends use any of these techniques, and the jaded instantly distrust anyone who is using them.

The used bookstore on the closest university campus has an entire shelf full of old editions of Edward Tufte books. I don't know if you'll be so lucky, but it's well worth a shot.

stefco_ · on Nov 14, 2019

I agree that these techniques are often used in popular media in deceptive ways, but a couple of those plotting techniques have valid use cases.

Specifically, if you're really concerned about a delta, removing the origin is good for visualization; it lets you just see the difference between two trends (in particle physics, you can show an energy excess this way). Likewise, for exponential phenomena, log plots are the correct choice, since uncertainty will be magnified for larger values. In both cases, of course, you need to include error bars, but that is always true. But I don't think you can "instantly distrust" someone who is using these techniques when they are valid.

hinkley · on Nov 14, 2019

Most graphs with deltas are comparing multiple deltas, not a single delta, which results in misrepresentation.

By screwing with the origin you can make the alternative that saw a 6% reduction look much more compelling than the one that achieved a 4% reduction, when in fact it's probably only slightly more compelling.

stefco_ · on Nov 15, 2019

Yes, I agree that these are often abused tactics. Just saying they have their uses. The fundamental problem is graph literacy.

knzhou · on Nov 15, 2019

You should be specific about who you’re calling out. I’m sure this is good advice for journalists making infographics for the public, but it isn’t for scientists talking to other scientists.

I design my plots for readability and in the process regularly break every rule you listed here, because I trust my audience (other scientists) to understand how to read axes. Yes, it might be confusing to somebody who doesn’t read graphs with axes that often, but if I optimized my papers for their convenience, they would all have to be 100 pages long.

mattkrause · on Nov 15, 2019

Dogmatically insisting on origins-at-zero is silly: it depends too much on the data, the message, and the audience.

For example, my lab studies the electrical activity of brain cells. A neuron normally sits at -70 mV relative to the extracellular space. When it fires, it can briefly cross 0 mV (though it probably won’t reach +70 mV) and sometimes you might want to show that complete spike. Other times, subthreshold changes in the cells’ activity are more important to a hypothesis, and then you’ll zoom in on a smaller range (maybe -90 to -55 mV). Neither of these situations would benefit from a plot centered at zero; in fact, it would look decidedly odd.

(Your other comment about log-log plots also seems odd to me, because the width of the line is almost never meaningful, unless there’s something like a confidence/credible interval band, which case that’s the whole point.

hmwhy · on Nov 15, 2019

I don't think the comment on log-log plots is limited to just log-log plots, and the point is that there are times when the width of the line of best fit, together with the size of data points, misleads the reader to think that it's more linear than it actually is.

Problematic presentation of a similar type is particularly pronounced in, for example, finite-size scaled curves, where the perceived goodness of fit is heavily influenced by the line widths of the curves, the size of the data points, and the resolution of the plot.

These misleading data representations are something that can be found even in recent publications in prestigious journals (I don't think it's always intentional and therefore won't include any references); also, sometimes it's difficult to notice there are issues unless you are very familiar with the methods, or have tried to reproduce the results (and have wasted months or even years of your life by then).

hinkley · on Nov 15, 2019

Clearly you have to 'zero out' a graph for whatever the default state is. If I claimed otherwise, that was in error.

Yes, if the default value is -500, then setting your origin at 0 is going to cause confusion.

I think the problem in part here is that in academia you can make your publishing quota for a period of time on a 2% improvement to some process. In which case you're amplifying the outcomes as a form of self-promotion.

Rarely do software developers, city planners, or who knows what other professions are going to get any accolades or shifts in public opinion off 2% (unless it's taxes). In both cases you're trumping up the numbers. In one it's expected, and I think we can agree that things are perhaps not as they should be in the paper publishing circles without having to open that can of worms to define exactly why or in what ways.

dredmorbius · on Nov 14, 2019

Sorry, can you explain / show an example of the log plot issue?

leeoniya · on Nov 15, 2019

agreed. in working on uPlot, i've explicitly tried to stay away from misleading, hard to interpret, but pretty shit like line smoothing, stacked areas, etc.

https://github.com/leeoniya/uPlot#non-features

nycticorax · on Nov 14, 2019

This is like a non-programmer telling programmers they need to comment their code better. In a sense, it's probably true most of the time. But it's also sort of hard to take very seriously, because the non-programmer doesn't have any real understanding of the day-to-day challenges and trade-offs faced by people who write code for a living.

SubiculumCode · on Nov 14, 2019

Not to mention that most of us scientists spend quite a bit of time on our figures.

setr · on Nov 14, 2019

One major difference is that code comments aren't at all meant for public consumption (even in open sourced code, comments have a very constrained expected audience).

If anything, public documentation is probably a better analogy -- expected audience of developers, but of potentially wildly different experience levels.

And in that case, it's generally fair for say a novice to complain about the lack of examples, clarity,etc.

And these charts are the same -- this is the primary entry point for a novice in the subject, and a helpful tool for experts. A poor chart would be harmful for all expected audiences.

nycticorax · on Nov 17, 2019

The fact the code comments aren't meant for public consumption, and figures are, is completely irrelevant to the analogy. The point is that lots of things we do could be done better, but having an outsider focus on one of them, and tell you you could be doing it better, is not super-helpful. It's like being told by three people that you need to put a cover sheet on your TPS reports.

That said, and having looked at the article more carefully (ahem), most of the concrete suggestions they offer are sound, and are things that I think are pretty well-known, at least to the practicing scientists I talk to. "Pie charts are bad", and "Beware of false color images, especially ones that use the 'jet' colormap." are both (parapharases of) statements that I hear from working scientists.

But then there's this: "And yet few scientists take the same amount of care with visuals as they do with generating data or writing about it. The graphs and diagrams that accompany most scientific publications tend to be the last things researchers do, says data visualization scientist Seán O’Donoghue. “Visualization is seen as really just kind of an icing on the cake.”"

This bears zero resemblance to my experience. Most scientists of my acquaintance make the figures first, and then write the text. And certainly I'd say that more care is taken with the figures than the text.

I guess I just feel like I see a certain amount of fetishization of beautiful scientific illustrations (Tufte, etc), out of all proportion to its actual importance to the scientific endeavor. The readers of most scientific articles are probably not going to be flummoxed by a pie chart. And of course, time spent focusing on this kind of stuff is time that might very well be better spent doing experiments.

knzhou · on Nov 15, 2019

Scientific figures aren’t designed for public consumption either; the intended audience is the set of other scientists subscribed to that journal.

noobermin · on Nov 15, 2019

I sort of agreed until I googled the author's name. They were a "Knight Science Journalism Fellow at MIT" which I'll be honest, not sure I'm super aware of what that is, but I imagine they are somewhat knowledgeable.

solveit · on Nov 15, 2019

The author has a Master's degree in geology, but she's spent most of her career writing for laypeople and approximately none of her career writing for subject matter experts.

malshe · on Nov 14, 2019

Alberto Cairo recently released his "How Charts Lie". It's a delightful read.

https://www.amazon.com/How-Charts-Lie-Getting-Information/dp...

archgoon · on Nov 14, 2019

I remember a decade or so ago, the only way for me to figure out certain material values (this was material science / electrical engineering / chemistry) was to open up MSPaint and manually draw lines on chart data to find intercepts. Has the field gotten better about releasing raw data for papers?

Granted, this was sometimes because the only team that had measured a particular substance did it back in the 70s.

btrettel · on Nov 14, 2019

There are many programs to automate that, e.g., I use g3data. Unfortunately raw data is still rarely released today. One tip if the raw data wasn't released as a computer file: Look for a dissertation associated with the journal article you want data from. Dissertations often have tabulated data in an appendix.

mkl · on Nov 14, 2019

If you have the plot as a PDF or PostScript file you can sometimes write a script to extract the data directly from the coordinates of the plot elements (for PDF, uncompress the file with PDFTK, and then the commands are ~readable text). Sometimes the values extracted from plots contain more precision than values reported in tables.

It can be quite fiddly, though, so it might not be worth the effort unless you really need it, or you need to do it a lot with many similar figures.

tgb · on Nov 15, 2019

It varies by field. Some have standard repositories for data to go in - all RNA seq experiments get their raw data uploaded to GEO or similar, for example. But they usually don't specify their pipeline sufficiently precisely to be able to recreate the exact analysis done.

noobermin · on Nov 15, 2019

Not always, but for more reason papers, you could always shoot someone an email and ask for data.

_mghw · on Nov 14, 2019

I do simulations for a living aka numerical modelling. The results that I obtain need to be explained using plots. How behaviour changes with time, how component sizing will affect xyz parameter etc.

What I would dearly love is some way to animate the systems I simulate. A way to show HOW MUCH flow is going through a pipe, or how much heat transfer is happening in a heat exchanger. Something that's easier to grasp than dull plots.

Sadly, a) I do not even know what to google to find solutions for this, and b) all my primitive searches seem to lead to Blender, which has a large learning curve and way too much time investment requirements.

astrophysician · on Nov 14, 2019

https://yt-project.org/ -- yt for python was used by a few of my old colleagues for visualizing astrophysical simulations. There is yet another program that I've since forgotten but will try to come back here and comment if I remember.

If you can generate a single static figure, you can manually generate animations by generating each frame separately using your favorite plotting software. You can then stitch them together using many different tools (I use matplotlib) to generate an animation (video, gif, etc.).

What language do you like to use? What's your background -- are you a Matlab person?

mturk · on Nov 15, 2019

I work on yt, and I just wanted to comment to say that whenever I stumble on a mention of it somewhere, I genuinely feel warm inside. Thank you for thinking of us!

One of the things we've tried to do -- especially by leveraging the technology and innovations in the tools we build on (especially matplotlib!) -- is to make it easier to make visualizations and plots that are "by-default" communicative and information-rich.

And, of course we have lots of room for improvement, for picking up and encouraging "best practices" at the library level, but it has been really enjoyable and illuminating to have the opportunity to explore that with the community.

maxnoe · on Nov 15, 2019

Thanks for all the work on yt.

Did you consider replacing the rainbow color map images on your front page?

mturk · on Nov 15, 2019

Yup, and we don't use a rainbow colormap by default anymore! We do provide a few custom ones that were developed with viscm, which I believe was used by in development of viridis, magma, etc. I think the specific examples are all from the published literature, and we accepted them as submitted by users -- but a refresh would be good!

I opened an issue: https://github.com/yt-project/website/issues/70

contact_fusion · on Nov 15, 2019

My suspicion is that you are thinking of the tool VisIt ( https://wci.llnl.gov/simulation/computer-codes/visit/ ). VisIt vs yt has some tradeoffs, but by and large it is very useful for rapidly visualizing numerical simulations and supports a large variety of them. I use it daily.

astrophysician · on Nov 19, 2019

Thank you!!! Yes! VisIt is what I was trying to remember. I don’t know VisIt, only that it exists and numerical people (at least in Astro) use it.

_mghw · on Nov 14, 2019

I am a Matlab person, among other tools as well. I do something of the sort that you suggested in your second paragraph to plot time-varying diagrams by generating successive frames and then making a gif.

Matlab can also do some very nice colormap stuff. Just wish there was a method with less friction!

mirimir · on Nov 14, 2019

If you use a video format, viewers can scan in a controlled way.

Edit: OK, there's also gifview.

noobermin · on Nov 15, 2019

Someone mentioned yt, I'll go ahead and mention the imo clunky but standard in a lot of plasma/fluid physics, which is visit[0]. Not really a fan personally but a lot of people I know like it.

Tbh, for 2D data, I've written scripts that sort of just make a bunch of pngs from matplotlib and string them together using ffmpeg into a movie. For 3D data, I used to use mayavi and ffmpeg to make movies, but that too is pretty clunky. Mayavi feels closest to matplotlib which is why I like it, yt feels like there is way too much boilerplate to just get a plot (for example, I have to specify units for my data before a plot a flat array, I mean wtf).

[0] https://wci.llnl.gov/simulation/computer-codes/visit/

bigger_cheese · on Nov 15, 2019

Some of my coworkers have used the open source VTK library for this sort of stuff (Heat flow in a reactor in this case).

From what I saw it was C++ code so may not be the easiest thing to write.

noobermin · on Nov 15, 2019

VTK seems the standard, but it seems like you have to literally write a program to viz stuff. Is this true?

Like everyone else I'm super busy but I've been planning on just learning it even though the times I've looked, it's a true blue 90's C++ inheritance obsessed nightmare.

xorand · on Nov 14, 2019

I use javascript, especially with d3.js

noobermin · on Nov 15, 2019

I love javascript, honestly. Unfortunately, I'm not sure it works for HPC scale data without serious sampling to reduce the data you read in. The HPC scale data also means I need the ability to render things without X because downloading either the simulation data or X forwarding is just too slow that it makes it a grind.

airstrike · on Nov 15, 2019

Agreed. d3.js really is the answer here

martopix · on Nov 15, 2019

This is very true, don't get me wrong. But we have a problem, which is that scientists already need to be good (and efficient) at:

- Doing research - Supervising students - Teaching - Giving talks in a clear way - Writing papers in a clear way - Writing grants in an enticing way - Performing administrative tasks

This unfortunately is one of the causes why, in certain fields, industry research is much better than academic research. Google is draining brains from compsci departments all over the world, because a scientist at google is mostly a scientist. And there is much, much more teamwork.

lancebeet · on Nov 14, 2019

The "challenges" included in that article are a bit confusing to me, in particular the pie chart. It asks for the third largest segment and claims the answer is B. To me, it seems like B is the second largest and H is the third largest. I did some thresholding of the image and it also shows the order being C, B, H, D. Maybe I'm missing something?

knowablemag · on Nov 14, 2019

We've updated the story and added an editor's note at the end with the correction. Thank you again for spotting! -Katie, Knowable Magazine

knowablemag · on Nov 14, 2019

Hi, this was flagged to the editors, and a correction is coming. Thank you for spotting this problem! -Katie, Knowable Magazine

datashow · on Nov 14, 2019

Most charting packages have a terrible default: set y axis original point around y min, which is rarely appropriate in data visualization. In my opinion, the default should always be zero.

Another one is bars in bar/column charts are horribly wide.

geoalchimista · on Nov 14, 2019

> In my opinion, the default should always be zero.

This won't work well for a lot of cases. Imagine plotting the atmospheric CO2 concentration time series in the past 200 years. Setting the original point at zero ppm would not make sense because it can never happen. I'd say setting y axis original point at y min is a good trade-off since the plotting package is agnostic about the underlying nature of the plot.

danso · on Nov 14, 2019

I have mixed feelings. I agree that "y-axis should start at zero" should not be been as a sacred guideline by any means. But I reflexively feel that in the context of a data viz library, the default should be zero; thus requiring the user to explicitly set a minimum when needed, and to conscientiously do so.

But I do realize that practically speaking, it is much easier to set a y-axis to 0 (i.e. a constant value), rather than having to calculate the y-min or near-y-min for every chart.

mattkrause · on Nov 15, 2019

How would that work with negative data?

datashow · on Nov 15, 2019

What do you mean? Zero original point does not work with negative data?

mattkrause · on Nov 15, 2019

The OP's suggestion was that the y-axis should start at zero by default, not min(y). However, this seems like it would lead to empty plots when the data is all or mostly negative.

datashow · on Nov 16, 2019

Not really. Plot can be shown above and below the x-axis.

datashow · on Nov 14, 2019

I agree you may give a valid exception (I don't know the field), it may represent a large field. But I still think broadly speaking it is uncommon, but I don't have data to back it up.

nemetroid · on Nov 14, 2019

> Setting the original point at zero ppm would not make sense because it can never happen.

Why does that matter? I disagree. A graph item representing the amount of stuff in some context should double in size (exactly) if the amount of stuff doubles.

djrobstep · on Nov 14, 2019

This a common claim, but really non-zero y-mins are absolutely fine.

Do you think that every weather chart is wrong because it doesn't begin from absolute zero?

Usually when charts have misleading y-ranges, the real problem is lack of x-axis context.

For instance, a chart with unemployment over the last three years can be made clearer by showing unemployment over the last 50 years.

Much better to give a wider picture than pointlessly add whitespace.

glofish · on Nov 15, 2019

zero y-mins are usually recommended to show both the absolute values and relative variation at the same time.

Based on my own experience, the number of plots that are misleading because they do not start at zero is far greater than that of those that are insufficiently informative because they do start at zero.

That alone is not a reason for "banning" non-zero ymins, but it is definitely a good reason to more critically evaluate their use.

omginternets · on Nov 14, 2019

Also, the jet colormap.

frumiousirc · on Nov 15, 2019

A web site talking about information visualization and includes an obscuring top bar and throwing a huge pop up after loading. Yeah, that's a tab closing.

mongol · on Nov 14, 2019

Not only scientists. Politicians too. That would hopefully change the discourse to be more material and fact based and less retorical.

taurath · on Nov 15, 2019

Maybe if they used something prettier than R it would help?

lallysingh · on Nov 15, 2019

ggplot2 is pretty damn nice

pintxo · on Nov 14, 2019

Didn't even start reading, with the prominent pie-chart directly above the article. Not sure the author has any reasonable knowledge about visualization at all.

maxnoe · on Nov 15, 2019

Disadvantages and advantages of pie charts are discussed sensibly further down in the article