There's actually a subfield called 'scientific workflow systems' (SWS). An example of such a system is Kepler [1]. What the author(s) ask at the end of the article seems to me basically that they want process provenance. This is a fairly well studied problem.
+1 to this comment. It is a reasonably sized body of research, too. On top of this, scientific workflow scheduling maps to the classic scheduling problem, which is known to be NP-hard - the fact that the authors infer that automating the traversal is all that is necessary suggests that they have an incomplete understanding of the problem at hand.
The category on top of a given graph just adds path equivalences. While that might not seem like a big addition, it's crucial for restricting the structure's internal state.
It seems planning of "opportunistic" software projects faces related challenges? Though more like research direction exploration/planning, than research production-process management.
Gantt is ok when simple time and resource constraints dominate, and there's a simple path. But not all "project-design spaces" collapse that simply.
Even small opportunistic projects can have painfully large dependency graphs. "If/when browser bug X is ever fixed", "if I find a better algorithm", "if we get hardware X", "if/when these conditions are ever met, budget this amount of work along these vectors", "we could go this way or that", "here are some associated risks", "possible risk mitigations/explorations", "X is a candidate goal state", "with some path dependencies", "and can pay for this pattern of costs". "Y is an alternative candidate goal state", "this area of project space has a bunch of scattered payoffs", "sweeping this way is sparse on payoffs until the risky big one", ... and on, and on.
I've never seen tooling which wasn't ghastly painful for sketching out stuff like this, editing it, and keeping it up to date. I've hope for building something in VR. But perhaps I missed something?
One problem with modeling workflows as dags is how to come up with reasonable ways to express common control flow elements. That said, for the parts of processes that are dependency trees they are the right fit.
What kinds of control flow elements do you have in mind?
I know one we've run into a few times is "repeat this step as many time as is necessary", like the repeated stirring example in the article. There's some nuance here because there are actually two graphs backing a workflow: the "configuration" graph that describes the experimental design, and the "execution" graph that describes what actually happened. In this case, the configuration graph would have a single node with a self-loop for stirring, and the execution graph would have a variable number nodes, one after the other, based on how many times the stirring needed to happen.
The configuration graph might have cycles or more abstract control flow mechanisms, but I think the execution graph is always a DAG because it's describing what happened in the real world. Configuration graphs are useful at the design step, but the execution graph is probably more useful for coordinating the experiment and analyzing how it went, and we've found that in practice, scientists might need to make unexpected ad-hoc changes a week into a month-long experiment, so trying to capture everything up-front in the configuration graph can sometimes not really be reasonable in the first place.
One aspect around control flow that makes it a little easier for us at the moment is that it's not DAGs all the way down. Each graph node corresponds to a lab notebook entry of instructions performed by a scientist, so any complex details around operating instruments and collecting results are just represented as human-readable instructions in a reusable template. For higher-level handoffs between scientists and teams, though, we've found that the DAG model works out pretty well.
My experience mirrors this. For basic input/output dags are a rock solid model for reality. I make a distinction between 'what' and 'how' that seem to be basically identical to your config/exec. Many of the issues I have encountered are indeed in the parts where you want to describe the precise execution without making a noun out of every action you take (which tends to inhibit clear communication).
I'm a bit confused about what the nodes are. You're saying they are the individual lab notebook entries, but then I wonder where the results of the experiments are in this graph?
I would have put the instructions on what was done on the edges of the graph, and the results or measurements on the vertices. But I'm not sure I'm correctly understanding your model.
Heh, there are a lot of ways to model it, and we've iterated on the details quite a bit and I imagine we'll iterate more in the future, but here's a quick explanation:
In our model, a node is a "run", which is an experimental procedure performed on a sample (or sometimes multiple samples). Each run has input samples, output samples, and structured data results. A notebook entry can have multiple runs associated with it, since a scientist will likely be processing many runs at once, and every run lives in exactly one notebook entry.
A common use case is to take some input sample, perform some transformation on it, and produce a new output sample that will be used downstream. Another common use case is a "screening" step where you take a collection of samples (one run each), run some analyses on those samples, and discard the ones that don't meet certain criteria. So sometimes the output is a new sample, sometimes it's the same as the input, and sometimes there's no output.
An edge indicates that the output of one run will be automatically fed into the input of the next run, and generally shows the flow of physical samples through the different stages.
Experimental design definitely has some parallels to code structure (the control flow graph, not the AST), but I think most real-world experiments would be awkward to model as code in a typical programming language. Some examples of things that I don't think translate super-well: arbitrary-fan-out forks and joins, complex decisions on when a process is "done", ad-hoc changes in the middle of the experiment, and experimental condition metadata on each branch.
Currently, we have a visual UI for defining all of this, which is particularly nice for scientists without a programming background, but it's certainly possible that we'd have some sort of description language (maybe something like Makefile) in the future. And, of course, you could always write custom code to generate the graph and define it using our API (which isn't the same as the graph being derived from your code's control flow).
Also, like I mentioned with configuration vs execution graphs, more interesting than the "code" is the "execution trace". One of the main goals here is to keep a team of scientists in the loop on how a complex experiment is going. When you have hundreds or thousands of samples undergoing different treatments and analyses in parallel, it's not at all obvious how to visually represent what's going on, especially if you need to account for errors and ad-hoc changes that happen while the experiment is going on. In a certain sense, the "code" to define the high-level ideal structure is the easy part.
Spot on! Had these same thoughts trying to write my own mini data processing workflow. I think every scientist faces it at some point if they do any more
Complex experiments. Most like my previous advisor solved it largely through simple excel sheets and discipline. Solving it with code becomes about as comprehensible as a adhoc build system.
Out of curiosity, does your UI allow any form of "simulation", particularly a random fuzzing/monte carlo trials? Having the ability to simulate how a complex rule set plays out can help weed out errors before starting real experiments.
P.S. I admire the work y'all are doing at benchling! Bio-sciences are surprisingly anachronistic when it comes to performing research.
> Out of curiosity, does your UI allow any form of "simulation", particularly a random fuzzing/monte carlo trials?
No, at the moment we don't really have enough structured knowledge of the underlying steps to be able to do predictions/simulations like that. We're taking existing scientific workflows and adding structure to them, but there are still plenty of details that are unstructured and performed manually by scientists, and there's a lot of variability across use cases.
Monte carlo trials certainly seems interesting, though, and hopefully something we can explore more as the procedures get more structured. A similar idea that's a bit closer in reach is to do an analysis on past variations of the same experiment to predict how a newly-designed experiment will go. Possibly we could do a simulation based on historical data like that, although modeling it correctly certainly sounds tricky.
> We're taking existing scientific workflows and adding structure to them, but there are still plenty of details that are unstructured and performed manually by scientists, and there's a lot of variability across use cases.
That makes sense. It’s difficult to start adding structure to biological sciences. And to the fields credit I don’t think it’s out of lack of ability or effort but just the true complexity of dealing with biological systems. In contrast, even quantum mechanics is seemingly simple!
> Possibly we could do a simulation based approach on historical data like that, although modeling it would correctly certainly sounds tricky.
True, seems like you will have a very valuable historical dataset though. I’ve been pondering this type of problem for a few years since my (abortive) attempt at a PhD in material sciences. For my project I made some decent progress using various Bayesian and bootstrapping models to work with uncertainty in my models and tissue samples. Bayesian approaches, especially combined with MCMC type analysis can yield a lot of fruit dealing with biological systems with minimal need to understand the underlying models. But as you mention modeling the parameters correctly is challenging and there’s a lacunae in research broaching the topic. But at the least a system like yours could indicate items like conditional probability of failure given certain sequence sets or combinations of experimental procedures (e.g. cell lines A tend to do better when cultured in X media and tested with Y +/- Z hours). The article seems to indicate you’d have most of the data described in a format that’d be amendable to such data mining.
Just curious if you guys have experimented with anything like Apache Airflow where the core component is DAGs structured as simple code files using components for each node?
[1]: https://en.wikipedia.org/wiki/Kepler_scientific_workflow_sys...