I wrote a workflow processing system (http://github.com/madhadron/bein) that's still running around the bioinformatics community in southern Switzerland, and came to the conclusion that something like make isn't actually what you want. Unfortunately, what you want varies with the task at hand. The relevant parameters are:
- The complexity of your analysis.
- How fixed your pipeline is over time.
- The size of a data set.
- How many data sets you are running the analysis on.
- How long the analysis takes to run.
If you are only doing one or two tasks, then you barely need a management tool, though if your data is huge, you probably want memoization of those steps. If your pipeline changes continuously, as it does for a scientist mucking around with new data, then you need executions of code to be objects in their own right, just like code.
Make-like systems are ideal when:
- Your analysis consists of tens of steps.
- You have only a couple of data sets that you're running a given analysis on.
- The analysis takes minutes to hours, so you need memoization.
Another Swiss project, openBIS, is ideal for big analyses that are very fixed, but will be run on large numbers of data sets. It's very regimented and provides lots of tools for curating data inputs and outputs. The system I wrote was meant for day to day analysis where the analysis would change with every run, was only being run on a few data sets, and the analysis tool minutes to hours to run. Having written it and had a few years to think about it, there are things I would do very differently today (notably, make executions much more first class than they are, starting with an omniscient debugger integrated with memoization, which is effectively an execution browser).
So bravo for this project for making a tool that fits their needs beautifully. More people need to do this. Tools to handle the logistics of data analysis are not one size fits all, and the habits we have inherited are often not what we really want.
I agree with your sentiments about the nature of pipelines vs build system a la make. Many many people start down the path of putting the classic DAG dependency analysis as the foundation of their needs when in fact, this isn't so much of a problem in real situations, and is even somewhat counterproductive because it forces you to declare a lot of things in a static way that actually aren't static at all. I've found tools like this completely break down when your data starts determining your workflow (eg: if the file is bigger than X I will break it in n parts and run them in parallel, otherwise I will continue on and do it using a different command entirely in memory).
In my experience the problems in big data analysis are more about the complexity of managing the process, achieving as much parallelization with as little effort and craziness as possible (don't see any mention of that in Drake), documenting what actually happened when something ran so you can figure it out later, and most of all, flexibility in modifying it since it changes every day of the week.
One mistake that Drake appears to make (again, from my quick skim), is interweaving the declaration of the "stages" of the pipeline (what they do) and the dependencies between them (the order they run in). This makes your pipeline stages less reusable and the pipeline harder to maintain. Bpipe completely separates these things out, which is something I like about it.
Thanks for your feedback. We do mention parallelization in the designdoc, it's just not implemented yet. It's quite easy to add though. We have a lot of features spec'ed out, but not implemented.
I would appreciate if you elaborated on separating step definitions from dependency definitions. In my mind, they are the same thing. If you mean that steps might not be connected by input-output relationship, but still have dependencies, Drake fully supports that via tags. If you mean that steps might be connected through input-output files, but not depend upon each other, I don't frankly see how it's possible. And if you mean some other syntax which more clearly separates the two, Drake supports methods which achieves exactly that. If you mean something else, I would love to see an example.
> I would appreciate if you elaborated on separating step definitions from dependency definitions
As I said, I only very quickly skimmed since I'm busy, I might have overlooked information, and apologies in that case. But take the example from the front page:
So now suppose a new requirement comes along - Evergreen is also called "Neverbrown" sometimes. It's decided the best way is to convert all references at input so nothing else gets confused downstream. So I need an extra step, now
Adding this step forced me to modify the declaration of the original command, even though what I added had nothing to do with that command. With Bpipe, for example, you say
If I get contracts from a different source that don't need the renaming, I can still run my old version and I'm not changing the definition of anything:
run { extract_evergreens }
Hope this explains what I mean, and again apologies if this is all clearly explained in your docs and I just jumped to conclusions from the simple examples!
I see. Thank you very much. I think this is very cool. I can see several problems with this approach, and I would greatly appreciate it if you could comment on that. After all, I don't know Bpipe.
The fundamental issue is why do you have to repeat the filename, and I did give it some thought.
1. What your example does is allows to allocate dependencies based on positions. It's pretty cool. This seems to be easily reproducible in Drake, if we add a special symbol that would just mean "a temporary file" for the output, and "last temporary output" for the input (by the way, you don't need colons):
2. One of the problems, as you can see, that it only works if you don't care about the filenames, i.e. you use a temporary file. Similarly, your Bpipe expression:
run { fix_names + extract_evergreens }
doesn't care about filenames as well. How do you add it there? What if you need this file for debugging purposes, or if it's an input to some further step down the road? In this case, you'd have to do what you want to avoid doing (i.e. modify the original step).
3. I'm even more concerned with multiple inputs and multiple outputs. As long as your workflow is simple, you can get away with a + b. But when it's more complicated, you would have to do something like:
(I used * as an operator that puts two outputs together to create an input with two files for the next command. Mathematically, + is better for that and * is for what + is used in your examples. :))
As you can see, it gets unreadable so fast, that you'd want to use some sort of identifiers to specify dependencies, and would end up with a scheme pretty much equivalent to filenames. The fact that some file might be a temporary is a related, but parallel problem.
4. Now even worse, I'm not quite sure how this syntax could accomodate multiple outputs. If fix_name creates several outputs, and extract_evergreens uses only one, you can't get around it without some weird syntax and specifying a numeric position. It also gets out of hand pretty quickly and you're back to using some sort of identifiers, be it filenames or not.
5. Speaking of identifiers, you can use variables in Drake instead of filenames, so you can abstract filenames away. But it seems to me there's a more fundamental problem in play.
6. If you're concerned with coupling implementation and input and output names, Drake has methods for this:
To summarize, I think your example is cool, but it seems to only be practical for rather simple workflows. And I can also see how Drake can easily be extended to support such syntactic sugar. For more complicated dependencies though, I don't really see a better approach.
I would love to hear your further thoughts on the matter, and whether you'd like to see something similar to what I proposed in Drake. Or something else.
Sorry for the late reply - I was really busy yesterday and didn't have time to do it justice.
> One of the problems, as you can see, that it only works if you don't care about the filenames
This is a really insightful point - it touches on one of the ways Bpipe differs philosophically from other tools. Bpipe absolutely says you don't want to manage the file names. Not that you don't care about them, but it takes the position that naming the files is a problem it should help you with, not a problem you should be helping it with. It enforces a systematic naming convention for files, so that every file is named automatically according to the pipeline stages it passed through. So, for example, after coming through the 'fix_names' stage, 'input.csv' will be called 'input.fix_names.csv'. It does sometimes give you names that aren't correct by default, but it gives you easy ways to "hint" at how to produce the right name. Eg - if we want the output to end with ".txt" we write:
Similarly if there are a lot of inputs and you need the one ending with ".txt" you will write "$input.txt", if you want the second input ending with ".txt" you will write "$input2.txt", and so on. Part of this stems from the huge number of files that you can end up dealing with. When you start having hundreds or thousands of outputs naming them quickly goes from being something you want to do to a chore that drives you completely crazy and you want a tool to help you with. Bpipe's names definitively tell you all the processing that was done on a file which is extremely helpful for auditability as well.
> I'm even more concerned with multiple inputs and multiple outputs
As I touch on above, it's really not too hard. Bpipe gives you ways to query for inputs in a flexible manner to get the ones you want. The commands you write imply what files you need, and Bpipe searches backwards through the pipeline to find the most recent files output that satisfy those needs. Multiple outputs are similar ...
If you need to reach further back in the pipeline to find inputs there are more advanced ways to do it, but this works for 80% of your cases (the whole idea of a pipeline is that each stage usually processes the outputs from the previous one - so this is what Bpipe is optimized to give you by default).
> I think your example is cool, but it seems to only be practical for rather simple workflows. And I can also see how Drake can easily be extended to support such syntactic sugar.
It depends what you mean by "simple". I use it for fairly complicated things - 20 - 30 stages joined together with 3 or 4 levels of nested parallelism. It seems to work OK. I'd argue that it's more than syntactic sugar, though - it's a different philosophy about what problems are important and what the tool should be helping you with.
Another problem with BPipe's approach is if you change method's name, you invalidate the existing files. This can be a problem during development, when re-running steps are expensive.
I wouldn't argue with that. One could also say it's a good thing though ... if you're modifying the method enough that you need to rename it, those original files might not be valid any more, so it's a good thing if the tool wants to recreate them.
I think, that would be a stretch to say so - in my opinion, it's not a good thing for the sole reason that it doesn't let you opt out of it. And it's not one of those cases when you would want to enforce discipline because renaming a method doesn't have to do with its contents.
Actually, I don't think there are any philosophical differences, and I'll try to make my case.
> Bpipe absolutely says you don't want to manage the file names.
I think this is too strong a statement as I try to show below.
> So, for example, after coming through the 'fix_names' stage, 'input.csv' will be called 'input.fix_names.csv'.
fix_names is the identifier in this case. There's really not much of a difference whether you use identifiers to come up with filenames, or you use filenames to come up with identifiers. If anything, I think filenames are preferable, because the user doesn't have to be aware of the scheme the tool uses to convert identifiers to filenames. The fact that identifiers are just a little bit shorter (e.g. don't have .txt extension or something) does not overweigh the inconvenience of knowing where the files are. The problem with this approach is because figuring out where the files are requires knowledge of the tool inner workings, that can only be acquired from reading the code or documentation.
There's another problem with these naming conventions, is that if you use the same code in multiple steps, things can become quite confusing. How will BPipe name them? Or is the only way to handle it is to copy-and-paste the code and create another rule?
It seems like not clear enough separation between the code and the filenames can be a source of problems... Please correct me if I'm wrong.
I strongly prefer the first option, because there's less implicit things going on, and the code is separated clearer from the file naming. Besides, it's even shorter.
> Similarly if there are a lot of inputs and you need the one ending with ".txt" you will write "$input.txt", if you want the second input ending with ".txt" you will write "$input2.txt", and so on.
This can work for very simple workflows with maybe several cases of multiple inputs and outputs, but it's unmanageable when complexity grows.
Imagine a step which takes 3 inputs - one separate, one which is output #2 of a previous step, and one which is output #6 of yet another step. You can't use numbers to resolve that. You will end up coming up with some sort of semantic identifiers, which will almost completely replace BPipe's naming convention. And what's worse, they will be hard-coded in your step's commands, which means you'll have to edit the code if you want to change the filenames, or re-use this step's implementation somewhere else.
> When you start having hundreds or thousands of outputs naming them quickly goes from being something you want to do to a chore that drives you completely crazy and you want a tool to help you with.
I'm not sure I agree here. Here's how I see it:
Instead of naming hundreds of files, you have to name hundreds of methods (commands). Yes, you don't have to repeat the filenames to create dependencies, but you have to repeat the method names (in "contracts + evergreens"), and in a way which quickly breaches the boundaries of readability.
This doesn't work for complicated workflows, and for simple ones, I would prefer positional linking rather than comping up with names, like in the example I provided above.
There's nothing that prevents Drake from coming up with filenames from more abstract identifiers. We could come up with some syntax where you'd just give an identifier (say, "~contracts"), and we'll take care of the file location and name, just like BPipe does. The major difference is not this. The major difference is that we think you need to identify inputs and outputs to build the graph, and the method name is insignificant until you want code re-use, and BPipe seems to take the opposite position - that you need to give method names, and then use a separate expression to build the graph.
I think I provided at least a few strong arguments why BPipe is wrong on this one. I would really love to hear your further thoughts.
> As I touch on above, it's really not too hard. Bpipe gives you ways to query for inputs in a flexible manner to get the ones you want.
I'm sorry I didn't understand neither this nor the example you provided. Could you please elaborate? In the example you provided you identify different outputs by adding a number to their names. Is that how subsequent steps are supposed to refer to them as inputs - by the positional output number from the step that used to generate them?
> I'd argue that it's more than syntactic sugar, though - it's a different philosophy about what problems are important and what the tool should be helping you with.
I appreciate your opinion. But the way I see it is this:
1) As far as different philosophies go, I find BPipe's one to be a bit problematic for complicated cases.
2) And for simple cases, it all comes down to syntactic sugar.
I understand it's hard to argue an abstract, so I'll tell you what. Give me an example of a BPipe workflow that you particularly like, and I'll put it in Drake. I might need to invent some Drake features on the fly, but it's a good thing. This is what these discussions are for. I'll try to show you that there's no philosophical difference, and Drake has a more flexible approach overall. I am looking forward to this challenge, because your opinion is important to me.
Hey, just want to say thanks for the great discussion again. I'm a bit humbled at the length & depth of thought you're putting into it.
> The problem with this approach is because figuring out where the files are requires knowledge of the tool inner workings, that can only be acquired from reading the code or documentation
I suppose this is true but it's really not an issue I have in practice. I run the pipeline and it produces (let's say) a .csv file as a result. I execute
ls -lt *.csv
And I see my result at the top. There's really not a huge inconvenience in trying to find the output. Having the pipeline tool automatically name everything instead of me having to specify it is definitely a win in my case. I suspect we're using these tools in very different contexts and that's why we feel differently about this. It sounds like you need the output to be well defined (probably because there's some other automated process that then takes the files?) You can specify the output file exactly with Bpipe, it's just not something you generally want to do. There's nothing wrong with either one - right tool for the job always wins!
> if you use the same code in multiple steps, things can become quite confusing. How will BPipe name them
It just keeps appending the identifiers:
run { fix_names + fix_names + fix_names }
will produce input.fix_names.fix_names.fix_names.csv. So there's no problem with file names stepping on each other, and it'll even be clear from the name that the file got processed 3 times. One problem is you do end up with huge file names - by the time it gets though 10 stages it's not uncommon to have gigantic 200 character file names. But after getting used to that I actually like the explicitness of it.
> Imagine a step which takes 3 inputs - one separate, one which is output #2 of a previous step, and one which is output #6 of yet another step
Absolutely - you can get situations like this. We're sort of into the 20% of cases that need more advanced syntax (eventually we'll explore all of Bpipes's functions this way :-) ). But basically Bpipe gives you a query language that lets you "glob" the results of the pipeline output tree (not the files in the directory) to find input files. So to get files from specific stages you could write:
It doesn't solve everything, but I guess the idea is, make it work right for the majority of cases ("sensible defaults") and then offer ways to deal with harder cases ("make simple things easy, hard things possible"). And when you really get in trouble it's actually groovy code so you can write any programmatic logic you like to find and figure out the inputs if you really need to.
> Instead of naming hundreds of files, you have to name hundreds of methods (commands)
Not at all - if my pipeline has 15 stages then I have 15 commands to name. Those 15 stages might easily create hundreds of outputs though.
> The major difference is that we think you need to identify inputs and outputs to build the graph, and the method name is insignificant until you want code re-use, and BPipe seems to take the opposite position - that you need to give method names, and then use a separate expression to build the graph
Again, a really insightful comment, but I'd take it further (and this goes back to my very first comment). Bpipe isn't just not trying to build a graph up front, it really doesn't think there is a graph at all! At least, not an interesting one. The "graph" is a runtime product of the pipeline's execution. We don't actually know the graph until the pipeline finished. An individual pipeline stage can use if / then logic at runtime to decide whether to use a certain input or a different input and that will change the dependency graph. You have to go back and ask why you care about having the graph up front in the first place, and in fact it turns out you can get nearly everything you want without it. By not having the graph you lose some ability to do static analysis on the pipeline, but to have it you are giving up dynamic flexibility. So that's a tradeoff Bpipe makes (and there are downsides, it's just in the context where Bpipe shines the tradeoff is worth it).
> In the example you provided you identify different outputs by adding a number to their names. Is that how subsequent steps are supposed to refer to them as inputs - by the positional output number from the step that used to generate them
I think the "from" example above probably illustrates it. The simplest method is positional, but it doesn't have to be, you can filter with glob style matching to get inputs as well so if you need to pick out one then you just do so.
> 1) As far as different philosophies go, I find BPipe's one to be a bit problematic for complicated cases.
I can't argue with that - but that's sort of the idea: simple things easy, hard things possible. Complicated cases are complicated with every tool. I guess I would say that pipeline tools live at a level of abstraction where they aren't meant to get that complicated.
> 2) And for simple cases, it all comes down to syntactic sugar.
I guess I'd have to disagree with this, as I really think there are some fundamental differences in approach that go well beyond syntactic sugar.
> Give me an example of a BPipe workflow that you particularly like, and I'll put it in Drake
I wouldn't mind doing that - I'll need to look around and find an example I can share that would make sense (what I do is very domain specific - unless you have familiarity with bioinformatics it will probably be very hard to understand). I'll pm you when I manage to do this, but it may take me a little while (apologies).
Thanks as always for the interesting discussion. I think this is a fascinating space, not least because there have been so many attempts at it - I would say there are probably dozens of tools like this going back over 20 years or so - and it seems like nobody has ever nailed it. Bpipe has problems, but so does every tool I've ever tried (I'm probably up to my 8th one or so now!).
> It doesn't solve everything, but I guess the idea is, make it work right for the majority of cases ("sensible defaults") and then offer ways to deal with harder cases ("make simple things easy, hard things possible").
My contention is that while BPipe makes simple things easy, hard things possible, Drake makes both easy and possible. I think I've made some points to that regard, and gave you examples of Drake code which is just as easy to write as the corresponding BPipe's code without compromising on functionality. But to really conclusively prove this, I'm looking forward to more BPipe examples. So far, I haven't seen anything that is simpler (or even shorter) in Bpipe.
> Not at all - if my pipeline has 15 stages then I have 15 commands to name. Those 15 stages might easily create hundreds of outputs though.
When I first read it I thought this is a great point and you're onto something. But as I thought about it more, I realized that it only seems this way.
Here's the thing: if you have 15 stages but hundreds of files, it can mean only two things:
1) The vast majority of those files are leaf files, that is - they are either inputs (with pre-determined names) or outputs, which names you don't really care about (surprisingly). Drake can generate filenames for leaf output files with ease, as they don't affect the dependency graph.
2) The vast majority of those files are not leaves, but it means that the steps either:
2a) pass to each other dozens multiple inputs and outputs, and you have to either give them identifiers (as described above, Drake can do it too) or use positions (unmanageable).
2b) even worse, have a big and complicated dependency graph with much more than 14 edges, in which case your syntax of { a + b + c } will be almost definitely inadequate to describe such a complex thing (15 vertices and several dozens edges).
So, any way you look at it, Drake can do the same thing in the same way or better. Am I missing something?
> Bpipe isn't just not trying to build a graph up front, it really doesn't think there is a graph at all! At least, not an interesting one. The "graph" is a runtime product of the pipeline's execution.
I don't understand it. I'm afraid it doesn't work this way. You can't have the graph as a runtime product of the execution (i.e. after the execution), because it cripples your ability to do partial evaluation of targets. That is, you have to have dependency graph before you can even answer the question - "is target A up-to-date?". If you need to run the workflow to arrive at a conclusion, there's no guarantee how much time it will take. I also believe it unnecessary melds the distinction between the commands and the workflow. If your code needs to care about its dependencies, it can't be used out of context. So, maybe an example?
But if all you need to do is re-run everything every time, then it means you're really doing something trivial, and it also raises the question of why we need a tool like BPipe in the first place.
> An individual pipeline stage can use if / then logic at runtime to decide whether to use a certain input or a different input and that will change the dependency graph.
I don't see how it could work this way. Could you please give me an example along with the explanation of how BPipe will handle it on the control level?
> You have to go back and ask why you care about having the graph up front in the first place, and in fact it turns out you can get nearly everything you want without it.
I'm confused, I think nothing could be further from the truth. The dependency graph specifies what steps depend on what steps. If you don't know it, you don't even know how to start evaluating the workflow, because you don't know which step to build first. I don't understand this statement at all. Could you please elaborate or give me an example?
> By not having the graph you lose some ability to do static analysis on the pipeline, but to have it you are giving up dynamic flexibility.
I need to see an example of this.
> I can't argue with that - but that's sort of the idea: simple things easy, hard things possible. Complicated cases are complicated with every tool.
I don't think having 3 inputs is a very complicated case. And neither is having any dependency graph which is not a linear step1, step2, step3. My point is as soon as you get any of those, BPipe starts to slowly evolve into Drake, with some very weird syntax and inconsistencies (like having "implicit" dependencies in steps' implementations but having to also specify some or all of the dependencies in the "run" statement).
It's possible that I'm misunderstanding BPipe. Maybe some more examples would fix this.
> I guess I'd have to disagree with this, as I really think there are some fundamental differences in approach that go well beyond syntactic sugar.
I don't really see them. And you can't just disagree, you have to provide arguments. :) I understand you can see it differently, but it seems like so far, there could be a Drake workflow for every BPipe example, which uses the same ideas and is equally easy to write (but not necessarily the reverse). This means it all comes down to syntax, no?
Again, I might be misunderstanding BPipe.
I think it's really, really hard to argue abstract concepts. I would very much appreciate some examples. It doesn't even have to be your favorite workflow. Just give me anything. Write something and ask - "how would you put it in Drake?". I think my response would make it clear whether there are syntactic or philosophical differences. We've already established that there are some things BPipe cannot do as well as Drake can. I'd like to see the reverse to be true. Because in this case we can really identify philosophical differences, but if it's the opposite - i.e. Drake can do everything BPipe can with the same ease - than it's not a question of philosophy any more but design.
I'm not trying to attack BPipe. I just want to make the best tool possible, and if we make compromises, I want to make sure they are informed. We must consciously choose some things not to be as easy or possible in Drake for some other greater good. So far, I can't identify any of those things.
Show me. :)
Artem.
P.S. You don't have to give a real world example. I think that would actually unnecessary restrain and slow you down. Just demonstrate a basic concept, a feature, name your steps A, B, C - I don't care what they do. Only if it's something extremely exotic I might ask if there's a real world use-case for this, but I think I can come up with use-cases for pretty much anything. :)
P.P.S. Please include what you do to run the workflow in your examples. I suspect I might have misconceptions about what "run" statement does and how Bpipe resolves dependencies.
P.P.S. I appreciate the dialog as well. Especially since BPipe is your 8th tool. I would like Drake to be your 9th, and better than anything you used before, including Bpipe.
I'm sorry I don't have time to answer in full. I'm just going to respond to this one point because I think it's pretty fundamental and perhaps explaining it will clear up other things!
> The dependency graph specifies what
> steps depend on what steps. If you don't
> know it, you don't even know how to
> start evaluating the workflow, because
> you don't know which step to build
> first. I don't understand this statement
> at all. Could you please elaborate or
> give me an example?
I can see this is really really hard to grok if you're basing everything on the idea of a DAG, and so many tools are that it's very natural to think you couldn't do it any other way. Think of it as imperative vs declarative if you like. In Bpipe the user declares the pipeline order explicitly (as you've seen) - so that's the first part of the answer to your question. Bpipe knows which part to execute first because the user said to explicitly. But this isn't used for figuring out dependencies - dependencies arise as actual commands are executed. Back to our famous example:
If you run it once, Bpipe builds input.fix_names.csv. If you run it twice, Bpipe is clever enough not to build input.fix_names.csv again! How is that if it doesn't know about the dependency graph?! Well, it does it "just in time". It executes the "fix_names" pipeline stage (or "method") and that calls the "exec" command. The "exec" command sees that all the inputs referenced ($input variables) are older than the outputs referenced ($output variables). So it knows it doesn't have to rebuild those outputs, and skips executing the command. So what about transitive dependencies? If C depends on B which depends on A, (so dependencies are A => B => C) what happens if you delete file B? Technically you don't need to build C because it's still newer than A, but Bpipe can't see it any more. Well, Bpipe knows this too because it keeps a detailed manifest on all the files created. So when the call to create B is executed it can see that although B was deleted, it did exist and in its last known state was newer than input files, so there's no need to rebuild it, as long as downstream dependencies are OK.
So in this way Bpipe handles dependencies for you. What it does not do is figure out which order to execute things in. It does them in exactly the order you tell it. This is one of those things that conventional tools solve which isn't actually that important (in my uses) but which occasionally is very annoying - I actually want to control the order of things sometimes. I want to be able to tell it "do this first, then that, then the next thing" regardless of dependencies. Usually it's pretty obvious what the right order things should be in and there are other externalities that influence how I like to do it ("I know this part uses a lot of i/o so try to do it in parallel with another bit that's mainly using CPU", or "Let's run this part last because it will be after hours and the other jobs will have finished"). Having the tool think this stuff up by itself can save you a bit of time but it can lose you a lot because you don't have the ability to really control what's going on.
We're not getting anywhere. Just give me goddamn examples! :) Please! Examples!
> I can see this is really really hard to grok if you're basing everything on the idea of a DAG, and so many tools are that it's very natural to think you couldn't do it any other way.
There is no other way. BPipe is based on the idea of a DAG. You just don't see it.
> In Bpipe the user declares the pipeline order explicitly.
And this is a big mistake. The reason is simple - explicit order is very hard to manage once you have multiple inputs and outputs, and as a consequence, complicated (instead of linear) dependency relationships.
What you don't seem to realize, is that by "declaring the pipeline order explicitly" you create a dependency graph. It's a part of your workflow definition. Your workflow contains the full definition of the dependency graph. Even if it didn't, you would still use it. There is no other way.
This is what I meant when I said - you create your dependency graph in "run". And this is a bad idea.
> dependencies arise as actual commands are executed.
What does it mean exactly? That the first command will somehow tell Bpipe what to run next? If not, then I don't understand this statement at all.
> How is that if it doesn't know about the dependency graph?! Well, it does it "just in time".
It does not matter if you calculate the dependency graph before you run the first command, or as you run the commands. It makes absolutely no difference. The only difference is whether it is computable or not. If you say it's not computable until run-time, please elaborate on that.
> So in this way Bpipe handles dependencies for you.
So far I see that this is very standard and doesn't differ in any way from what Drake or any other tool does. The only thing that differs, and I am repeating myself, is how you define your dependency graph - through input and outputs, or in "run". So far it seems that "run" is quite unfortunate. But please give me examples.
> So in this way Bpipe handles dependencies for you. What it does not do is figure out which order to execute things in. It does them in exactly the order you tell it.
This is a meaningless statement. Drake also executes steps in the order you tell it. The only difference is how you tell it. In Drake, you tell it through specifying a list of steps each step depends on individually (once again, it doesn't matter that filenames are used for that - Drake also supports tags, or it could be some other identifiers). In Bpipe, you tell it in "run", collectively and sequentially. Drake's way supports the whole variety of graphs, while Bpipe's way - only a very limited subset. And for this limited subset, Drake can give you (I think) a syntax just as good if not better than Bpipe's. If you don't quite understand what I'm talking about, give me an example, and I will demonstrate.
> I actually want to control the order of things sometimes.
This is fine, the only question is how. You say Bpipe's way is convenient. I say give me an example and I'll show you that Drake's way is not any less convenient. I'm sorry to keep repeating myself, I thought I stressed the importance of examples quite a bit in my previous email and I want to stress it again. Examples, please!
> I want to be able to tell it "do this first, then that, then the next thing" regardless of dependencies.
This statement is self-contradictory. You don't seem to realize that by telling it "do this first, then that" you are defining dependencies. It's fine, and it's OK, and it can be convenient, but you can't say regardless of them.
Again - give me examples! Our conversation is becoming useless without examples.
You did not, but I'll just grab whatever you threw my way:
Isn't that much nicer? What disadvantages you can see?
Tell me what is it that you would like to do with this script, and I'll tell you a better way to do it in Drake. Is it multiple versions of run that you want to have? Easy. Are you concerned about inserting a step in the middle? Trivial. Tell me why Drake's code is worse, and I'll listen. So far it seems like it's better because it's shorter and more flexible at the same time.
> Having the tool think this stuff up by itself can save you a bit of time but it can lose you a lot because you don't have the ability to really control what's going on.
What exactly are you losing?
I am sorry if I sound irritated. I am. I've just been begging for examples, and you keep talking in abstract, and it would be fine, but you're making a lot of mistakes. So, instead of looking at concrete things that would make my point apparent to you (or the opposite, prove that I'm wrong), I keep pointing to flaws in your reasoning, which frankly, is irrelevant. One picture is worth a thousand words.
I really want your feedback. But please give me examples.
> There is no other way. BPipe is based on the idea of a DAG. You just don't see it.
So if you think Bpipe uses a DAG, then I wonder how you would think it deals with:
run { fix_names + fix_names + fix_names }
In terms of the pipeline stages that run this is cyclic, so it cannot be a DAG. On the other hand the files created do usually form a DAG dependency relationship, but even there, in the most general case, it's not at all impossible in an imperative pipeline to read a file in and write the same file out again in modified form (or more likely, to modify it in place), so the file depends on itself - another non-DAG relationship. I'm sure you'll object to this in a purist sense, and tell me it is a horribly broken idea, but as a practising bioinformatician, when I have a 10TB file and modifying it in place will save me hours and huge amounts of space, I'm much more interested in getting my job done than being pure about things.
I think you're right that we're at diminishing returns here, and I'm sorry I've frustrated you. We're trying to bite off more than we can chew in a forum like this.
I wish you all the best with Drake and I'll definitely check it out down the track (when it supports parallelism, since that's too important to me right now). For now, though, I don't intend to read / respond to any more replies in this thread.
This is not a cyclic dependency graph!!! This is a syntax for copying vertices, nothing else. It creates a DAG of three vertices and two edges, but uses only one step definition to do so. It automatically replicates the step definition as needed. It could be extremely easy to reproduce in Drake:
Is there any difference between Bpipe's version and Drake's version that I am failing to see?
> I'm much more interested in getting my job done than being pure about things.
It's funny coming from someone who I have been BEGGING for examples but getting abstract philosophical reasoning in return.
I repeat. Give me an example. So far you haven't given me one example of what Bpipe can do that Drake couldn't do in the same way or better, and yet you continue claiming philosophical differences.
If we concentrate on examples and discuss how they would work, whether there are differences, and what these differences are, I guarantee you, we'll make progress. But then again, I'm repeating myself.
>> The problem with this approach is because figuring out where the files are requires knowledge of the tool inner workings, that can only be acquired from reading the code or documentation
> I suppose this is true but it's really not an issue I have in practice. I run the pipeline and it produces (let's say) a .csv file as a result.
It's a good point and I, guess, I didn't mean it's a major issue. Just something which is, I believe, less than an ideal design, because it spreads (de-centralizes) information. For example, if you want to do something with the files outside of your workflow in some shell script, this shell script would contain a filename which a reader would have no idea how you came up with. Again, it's not something to obsess over, just an observation.
> There's really not a huge inconvenience in trying to find the output.
Even if so (highly doubtful in case of, as you say, 200 character filenames), this reasoning only applies to interactive sessions.
> Having the pipeline tool automatically name everything instead of me having to specify it is definitely a win in my case.
It's only true if you have to type less. I'm trying to make a case that you don't have to sacrifice clarity to achieve the same result. I'm trying to show you can win without losing.
> I suspect we're using these tools in very different contexts and that's why we feel differently about this.
That might be true, but we were also trying to come up with a universal tool. That is, we are willing to make sacrifices if not making them means severely limiting the scope of usage. But again, I am tying to show you don't even have to make sacrifices.
> It sounds like you need the output to be well defined (probably because there's some other automated process that then takes the files?)
Sometimes, yes; sometimes only for debugging; sometimes only for convenience. But more importantly, I'm arguing using filenames is just a better way to build the dependency graph regardless of whether you write them themselves or you use some identifiers that result in automatic filename generation. Remember I said Drake could easily do that? The core issue here is not filenames. It's what is the better (easier to read, less to type, easier to understand) way to define the dependency graph.
> You can specify the output file exactly with Bpipe, it's just not something you generally want to do.
Again, it's not the point. If you start specifying filenames exactly with Bpipe (I'm assuming you mean in commands themselves), you would just end up with a very strange beast: you'd have essentially define the dependency graph twice, once indirectly, and once directly. Or at least different dependencies in different ways. It seems like this would just be a total mess. But I'm trying to show even if you want to not care about filenames, Drake's approach is better.
> There's nothing wrong with either one - right tool for the job always wins!
My feeling so far was that it's not like a comparison of C and Python, but rather like a comparison of C and C++. There's absolutely nothing that you can't do in C++ better or at least as well as in C. Of course, I might be wrong, and that's why I would love to see an example workflow which I would then put in Drake and we'll be able to objectively compare.
> It just keeps appending the identifiers: will produce input.fix_names.fix_names.fix_names.csv. So there's no problem with file names stepping on each other, and it'll even be clear from the name that the file got processed 3 times.
First, I don't want to process the file 3 times - I didn't mean call the same method 3 times, I meant use the same code in different parts of the workflow. For example, you have a method to convert data from CSV to JSON, and you use it a dozen times all over the workflow.
Secondly, I think this is pretty bad. The way you described it, it makes filenames situational - i.e. depending on what part of the workflow they're in. Removing one fix_names from the chain could invalidate other fix_names's inputs and outputs, or worse - not invalidate the timestamps, but make such a huge mess, the user won't even know what hit him. Editing the workflow should not require such careful consideration for the tool's inner workings. And if you can afford to re-run the whole thing every time you add or delete the step, you're working on something very, very simple.
> One problem is you do end up with huge file names - by the time it gets though 10 stages it's not uncommon to have gigantic 200 character file names.
I apologize I didn't even realize the filenames carry all their creation history - I thought it was only the case with repeating names. I don't want to be harsh, but I think it's beyond bad. It means any change to the workflow can invalidate everything. This makes BPipe unusable for anything even remotely expensive. Please correct me if I'm wrong.
I'm sorry, I tried but I didn't understand this code. Could you please elaborate? What do you mean "glob"? The way I see it, you may glob all you want, but there are just two ways to resolve this: use positional numbers or use some sort of identifiers. If you use positional numbers, it becomes unmanageable. And if you use identifiers, we're back where we started. It doesn't matter if they're filenames or not, what matters is that once you started using identifiers, you can generate the dependency graph yourself, from identifiers. In other words, you've arrived to Drake's model.
Thank you very much. We're really looking forward to other people using this tool.
You raise some interesting points (for example, a frequently changing code), which we ran into as well. Our current approach to it is not as fundamental, and basically includes ability to force re-build any target and everything down the tree and methods, and you can also add your binaries as a step's dependency.
I'm sure as we and other people use the tool, we'll have better ideas. For example, Drake could automatically sense that the step's definition has changed and offer to rebuild or dismiss.
Other points you raised are also definitely worth thinking about.
I'm also a developer of a workflow processing system, though not open-source, and fairly specific to our company. A few more things that are desirable if you have a lot of data or need to do processing that takes a lot of time is the ability to run stages in parallel, and also to distribute the computation over a cluster of machines.
Drake supports the ability to run stages in parallel (at least in theory) - it's been speced out (https://docs.google.com/document/d/1bF-OKNLIG10v_lMes_m4yyaJ...), just not implemented yet. But of course, once you have the entire dependency graph, it's easy to know what can be run in parallel and what cannot.
As for distributing computations, our approach is that it lies outside of Drake's scope. Drake doesn't know what's going on in steps. But you can always implement a step that would use distributed computation, for example, by submitting a Hadoop job, or in any other way. The only requirement Drake has is for the step to be synchronous, i.e. do not return before all the computation is complete. But even that can be changed for some cases.
I really wish that I had a tool like this back in grad school. I was doing bioinformatics work and merging, chopping, and processing various datasets over many months. When a new version of the underlying data came out it was not an easy task to go back and re-process it through dozens of steps in Perl and R. Having a tool like this would have made it a single command to do so and also ensured repeatability and transparency in my data, something which is often sorely lacking in an academic setting.
I am one of the data engineers at Factual and though I didn't have a role in creating it I definitely enjoy using it on a day to day basis. You begin to see the utility of it when you have a dozen people working up and down a data pipeline and need to coordinate as product specs evolve or schemas change.
I also really like the tagging features - you can add specific tags to different steps in the build and run different "flavors" of your workflow depending upon what is needed. For example, you might build a workflow that collects, cleans, filters, and performs calculations on data from all over the world - but you might also want alternative versions of the build that only work on specific regions or smaller debug datasets. Tags make that really simple to do, even when many steps are shared by the different versions or the dependencies are complicated.
I've spent a lot of time working with pipelining software, first for my last job doing bioinformatics research, and now for handling analytics workflows at Custora. We ultimately decided to write our own (which we are considering open sourcing, email me if you are interested in learning more).
The initial system that I used was pretty similar to Paul Butler's technique, with a whole bunch of hacks to inform Make as to the status of various MySQL tables, and to allow jobs to be parallelized across the cluster.
At Custora, we needed a system specifically designed for running our various machine learning algorithms. We are always making improvements to our models, and we need to be able to do versioning to see how the improvements change our final predictions about customer behavior, and how these stack up to reality. So in addition to versioning code, and rerunning analysis when the code is out of date we also need to keep track of different major versions of the code, and figure out exactly what needs to be recomputed.
We did a survey of a number of different workflow management systems such as JUG, Taverna, and Kepler. We ended up finding a reasonable model in an old configuration management program called VESTA. We took the concepts from VESTA and wrote a system in Ruby and R to handle all of our workflow needs. The general concepts are pretty similar to to Drake, but it is specialized for our ruby and R modeling.
It looks like all of the drakefiles could be replaced pretty trivially with Makefiles. Replacing '<-' with ':', ';' with '#', and '$INPUT', '$OUTPUT' with '$<' and '$@', and inserting shell invocations of the Python interpreter looks like it would do the job.
The major differences I see are:
- Inline support for Python et al.
- Confirming the steps that will be taken.
- HDFS support.
The example in the blogpost is understandably trivial, and it can be implemented in almost any Make-like system.
The concept of Make is not unique. Everything that has dependencies and executes steps is similar to Make in concept. Drake is no exception, and it can be replaced with Make, but no more so than Rake, Ant or Maven can be replaced by Make. That is, if it's trivial - yes. Just a bit more complicated - no.
Some things are merely painful to implement with Make, some are just impossible:
- multiple outputs
- no-input and no-output steps
- HDFS support
- Hadoop's partial files support (part-?????)
- forced execution of any subbranch, up or down the tree or any individual targets (crucial for debugging and development)
- target exclusions
- protocol abstraction - inline Python is just one example
- tags
- branching
- methods
These are just what's implemented already. Other things are planned such as:
- automated data versioning (backup and revert)
- parallelization
- real-time status console
- retries, email notifications
- etc.
Requirements for building executables and working with large, complicated and expensive data workflows are quite visible different, and the most important thing about Drake is that it provides the platform for convenient features (such as versioning or email notifications) to be implemented. And once they are, every data workflow can take advantage of them.
I guess, if Make was really, really extendable, we could have considered it as a platform for all this. But it's not, and hacking all of that into Make's source code in C would be, I'm sure, a much greater pain than writing Drake.
retries and email notifications is a good one. Currently I do something similar with cronjobs, rsync, shell scripts and some custom tools -- on multiple boxes. (Email notification with mailx) Works in theory pretty well, in practice race conditions become a problem, making it sometimes annoying because I need to run things manually when I need up to date processed data. If I had retries, this would be an improvement.
Got ya. Please voice your opinion about the priority in which features should be implemented by submitting a feature request at https://github.com/Factual/drake/issues, or +1'ing an existing one.
There are so many potential features to be added to Drake, and a lot of them have already been thought about and spec'ed out, that we need some sort of a way to figure out what to do first.
Of course, if you'd like to actively contribute, we'd be ecstatic.
Make can support Python, or any other language you'd like. Just set ONESHELL to avoid splitting commands by line, and then set SHELL to your preferred language interpreter. Make will then hand that interpreter the entire body of commands to rebuild a target.
Drake supports "protocol" abstraction, which is much more than just specifying an interpreter. Python is a trivial protocol, not much more complicated than shell. There are slightly more complicated protocols, for example, "eval", which runs the first line as a shell command before putting everything else in $CMDS environment variable. There could be protocols for running an HBase query, a Pig query, Cascalog query, or an SQL query. Some of these things could involve building a JAR file and giving it to Hadoop binary. Currently only a handful of protocols is implemented, but more are described in the spec.
Make was a major inspiration for us, and so Drake definitely has similarities to Make. The differences you list were non-trivial to us in usefulness, but of course YMMV. Also, there are a lot of (possibly) interesting future features described in the spec.
Does it have to have big differences? It's a slightly nicer system with a fairly shallow initial learning curve. If you're on a new project, what's the problem? I'm wondering how well it would work as an actual make replacement.
With an empty workflow, this is the result of `drake --version`.
$ time drake --version
Drake Version 0.1.0
Target not found: ...
drake --version 5.42s user 0.18s system 188% cpu 2.969 total
For short scripts that you should be running in the shell, this is really bad. I expect basic make commands on small projects to be effectively instant. Compilation might take a bit longer, but 5.4s to print the version points to a 5s overhead on all executions.
I'm guessing this is due to the JVM overhead, so that pretty much says this project isn't suited to the JVM. The JVM is great for long running processes, and applications where the overhead is a very small percentage of the total running time, but if it takes 5s longer than `make` to print it's version, that's really not a good sign.
This is a fantastic idea, and I will definitely be using it. But this overhead needs fixing.
First of all, --version shouldn't try to run any targets. This seems like a bug. Thanks.
Yes, you guessed correctly - this is the JVM startup time. I just hate JVM for that. We experimented with Nailgun and Drip to eliminate it - Nailgun is problematic because it uses a shared JVM for all runs, and it can get quite hairy sometimes. In the long run, Nailgun is almost certainly not an answer, since it assumes things we have no control over (i.e. Clojure runtime) don't do destructive tear down. Drip is a bit more promising, but we didn't succeed running Drake under it (simpler things worked fine though).
So, we're still looking into it, and we're looking for other ideas, too.
In the meantime, you could run Drake under REPL:
(-main "...")
The only problem is that Drake calls System/exit but we can add a flag ("--repl") that would prevent it from doing so, and you'll stay in REPL.
Thoughts?
P.S. JVM is unfortunate but Clojure is a fantastic language for something like Drake.
I have limited experience with Clojure, but it does seem to be a good match to this sort of task due to it's structure. However the JVM seems to be a real drawback to me. Perhaps with something like Scheme or Lisp you might get a similar program structure, and be able to compile to faster binaries?
The REPL is a solution, but as many developers are using tools like make with many other tools in the shell, running a REPL like that would prevent them from using other things efficiently. Ultimately I think the overhead time needs to be removed.
If it takes far longer than something like make, that's not necessarily an issue. The key point is making it fast from the user's perspective. As long as it runs in a fraction of a second, I can't see much of a difference between 0.1s and 0.0001s, so I don't think that sort of difference really matters, it's when it gets over 1s that it becomes an issue.
Running something like Nailgun in the background may be a good solution, I don't have any experience with it. But if it requires starting a daemon in the background, that could get in the way of using the tool in a normal way.
I don't really know what the best solution to this problem is. I'm not sure Clojure is the best tool for the job.
I can certainly see your point about using Drake in an automated environment where this delay would still matter, but running a daemon is not practical. I think you have a lot of good arguments against JVM. There were some moments when I thought it might not have been the best choice as well - for example, Java world is notoriously poor with dealing with child processes.
So, I agree, but there are several arguments that it's not that bad after all:
- Drake is fundamentally an interactive tool. If you run it as a part of an automated process, all its flexibility is not quite needed. You could have Drake print a list of all shell commands it would execute, and save it to get your automated script.
- Most data workflows Drake is good for are quite expensive. Minutes, sometimes hours. Definitely much more than 5 seconds. The reason is simple - if your workflow takes so little time, you're really not gaining much by using a complicated tool like Drake, instead of just putting it all in a linear shell script, and simply re-running everything every time you need it.
- Maybe we'll find a good solution like Nailgun and Drip.
- Maybe someone will make a Java-code compiler that would create a stand-alone executable out of a JAR.
- Maybe Sun will eliminate JVM startup overhead. Or somebody will release a 3rd party JVM without it.
- Maybe we'll have a compiled version of Clojure one day.
- Other maybes. :)
We certainly would support any effort to port Drake into Lisp, C++, Ruby, Python or any language you desire. Porting it into Common Lisp might not be that much easier than to Ruby. We might not consider it ourselves, since the effort will be quite substantial.
I would say if a startup overhead time of < 10 second bothers you, you're not working with "data". Of course sed and grep have less overhead, but I wouldn't even thin of trying out a new tool for files/datasets larger than, say a Gigabyte. (Rough guess, I know you can use grep and sed in under 10 seconds for larger files, the point is about perspective and complexity.)
Clojure is sadly a really bad choice for fire-and-forget cli scripts, but "large scale data processing" doesn't fit this criterion for me.
I'm mostly going to use this for parsing XML into some other formats and getting it into SQLite databases I think. The reason I would like to use Drake over 'raw' Python scripts is because it supports a lot of the mundane stuff that goes around the actual processing of the data, and I want to automate the processes.
I typically deal with sub-100MB XML documents, so processing them takes very little time, but having the quick iteration of changing the format and re-outputting is a key part of the development cycle for me, and I think very useful when you are experimenting with new data and seeing how it could be used. Doing quick transforms is awesome.
Drip now works with Drake! Yes, it's still less than ideal if you're calling Drake hundreds of times from an automated script which you need to run quickly, but for interactive development, it should work just fine:
It's a good point, and I agree it might not be the top priority, but I also understand the frustration. I, too, find 5s start up file rather irritating especially when I make errors in the workflow file, or didn't specify targets correctly. So, we are in search of ideas on how to fix it.
To be honest with you, no, we didn't seriously consider it. Maybe we should have. I do not know if ClojureScript would be able to work with all the dependencies we have (for example, Hadoop client library to talk to HDFS). But it's a good point nevertheless. I'll mention it in https://github.com/Factual/drake/issues/1.
I didn't realise originally that Drake integrated with HDFS. Thats a really awesome feature, and I can see why the JVM made sense in development because of existing HDFS libraries.
Thanks for the response! I ask because I have an idea for a CLI program, and I want to write it in Clojure, but I'm worried about the startup time of the JVM. As I understand it, this issue is mitigated in Drake by the fact that a typical job will crunch lots of data and therefore take lots of time. That's not the case for my program, it needs to be quick.
Yes, startup times are a pain. As of this morning, Drake now works with Drip, which is a nifty tool to bring down start up times. It spins "backup" JVMs, so next time you run the command, JVM is ready. It works great for interactive environments where at least several seconds pass between runs, but won't do much if you need to run Drake several times per second from an automated script.
Another option is Nailgun, but it has its limitations, too.
None if this is ideal. If you want to write a very simple CLI program, keep this in mind. You may want to stay away from JVM.
I could imagine a bash shell that helps create drake files, by remembering in a richer history structure all files read/modified by subprocesses.
(A degenerate drake file, one line per 'step', would almost be a 1:1 representation of this richer history... though you then might want to coalesce and reorder atomic steps to represent the real shape of your workflow and dependencies.)
Djb redo[1], a make alternative, feels like a good fit for these type of data manipulation and dependency representations. Below is a port of the first example. The build script is just shell, so you can do stuff like embed python with a heredoc. One bit of syntactic sugar is that redo assumes stdout is the desired contents of the generated file, so you don't need to explicitly pipe to an OUTPUT variable.
I suspect most of the points I made would be applicable to redo as well, if not more so. Trivial things don't require Drake. Heck, they often times don't require Make as well - just put it in a linear shell script if the steps are not too expensive. It's when things are getting complicated you need something like Drake.
Redo lacks features baked into Drake, especially the Hadoop integration, but I believe it would be easier to incorporate custom functionality into redo versus hacking Make or writing a custom build system. I haven't used Drake, so I would be interested in a small but complicated Drake script which tackles an intractable problem in Make. I don't claim redo can provide a cleaner solution than a purpose-built system, but I think it will be unexpectedly simple.
The most crucial thing that Make lacks is multiple outputs and precise control over execution. When you're debugging/developing a large and expensive workflow, you absolutely must have the ability to say things like:
- run only this step, I'm debugging it
- I've changed implementation of this step, re-build it and everything that depends on it
- build everything except this branch, it's expensive and I don't need to rebuild it that often (example: model training)
Other examples of intractable problems in Make would be timestamped dependency resolution between local and HDFS files. If Make can't look at HDFS, it can't say if the step needs to be built or not. I don't think you can fix it with external commands.
But generally, search for intractable problems is a futile one. Remember, everything you can code in Java, you can code in a Turing machine. :)
So make's default behaviour "make somefile.csv" is to build the whole tree of dependencies. To force rebuild of everything, run "make -B somefile.csv". It then assumes everything is out of date.
To force rebuild of one step, just delete its output or run "touch" on one of its dependencies before running make. Then that step will get redone.
I like to have generated data in a separate folder, say "output/" which you can then snapshot, blow away, or do what you like with. Basically though, I keep it separate from data and code inputs.
Thanks! This much I know. But it doesn't answer my question. Let me repeat it: could you please give me a command to re-build a particular target and everything that depends on it?
Aboytsov wants to rebuild the target and everything that depends on the target, not rebuild the target and everything that the target depends on. He wants to walk the dependency tree in the opposite direction.
No, make -B mytarget rebuilds either mytarget only or mytarget and everything mytarget depends on. A more common scenario is when you need to rebuild mytarget and everything that depends on it. Without rebuilding other parts of the workflow that you don't need.
This is a really weird request. make won't rebuild things that haven't changed, so the default make all rule will only rebuild the things depending on mytarget. Every time you change mytarget, just run make (all) and everything that depends on mytarget (and only those things) will be rebuilt.
This is not a weird request, this is one of the most common things we do when we're developing a workflow. You need to do this every time you make changes to code and you want these changes to propagate.
You can't run "make all", because it literally builds everything. You might be working on a specific branch of the workflow, and the overall workflow could be huge. And out-of-date in a lot of places. Or it could contain steps that are very expensive, but not necessary to build for your development purposes (for example, generating a model). This is why exclusions are also important, and make also does not support them.
Make also does not support multiple outputs, and I gave you a prooflink before. And a lot of other things which we think are important, too (I could make a list. I did, actually).
If you like Make, you should continue using it. I think it is a little arrogant on your part to try to explain to us that we simply wasted our time. We built the tool to address the problems we were facing. If you do not face similar problems, by all means, use Make.
Sorry, I didn't mean to imply "you're doing it wrong". Didn't even realize you made the tool. Oops. Personally, if large chunks of my output are out of date, I don't like the idea of commingling them with new stuff, but obviously I don't know a whole lot about what you're doing.
Now imagine you're not the only one working on it. You may have even never run it in its entirety, since it takes 10 hours. Imagine there's a branch which you, a developer, is currently working on. This branch depends on some other files in the workflow. Let's say, generate synonyms from the sentence dataset. Or, some complicated cleaning of some intermediate data. This is not a small task and you will spend a couple of days doing it, re-running your code dozens of times in the process.
You don't care about other parts of the workflow. You only care about what you're developing and how it propagates. Does it propagate? Does it break something down the road? What is the final output? Did all this synonym collection help? Did the changes you made in learning code improve the results?
When you're done, you may commit your code and somewhere else somebody will build a nice new dataset, but while you're working on it, you really need to be able to run any target individually, with dependencies or without, as well as forcibly rebuild all steps down the tree to see the final result.
This is basically the case where you don't (or it's infeasible to) capture the dependences fully, so you want to rebuild everything from target onwards after some change.
Haha, well pointed out. I'm clearly having trouble parsing today.
"rm target; make" can work, but only if you're using a pattern for data pipelines where there is only one default set of downstream targets. If the one Makefile supports a range of downstream targets, then this won't work.
I concede, make doesn't support that operation out of the box :)
Reminds me of Makeflow: A Portable Abstraction for Data Intensive Computing on Clusters, Clouds, and Grids,
Workshop on Scalable Workflow Enactment Engines and Technologies (SWEET) at ACM SIGMOD, May, 2012.
Nice. Surprisingly, we weren't aware of Makeflow and kinda missed it completely. On the first look, it seems like Drake is quite a bit more feature-rich than Makeflow. Please see the designdoc and/or the tutorial video for details.
Cool project. I expected to be underwhelmed, but when I saw the dependency stuff, I was impressed. Maybe it should include a hook so that it can detect dataset changes automatically by running a separate command (or did I miss it?).
With a bit of creativity, I think there may be a lot of applications here.
This is an awesome idea. Currently Drake only supports timestamped and forced evaluations, but it would be great to have an evaluation abstraction where you could provide your own implementation of whether a target's changed and/or whether a target is to be considered fresher/younger than another target. Timestamped would compare modification times, forced would return true, and it could be extended indefinitely.
If you're serious about it, please submit a feature request (https://github.com/Factual/drake/issues), and describe more specifically what you would like to be able to do in your case.
Artem, the approach you guys are using is really EXCELLENT!
I think that a bit of a disconnect here may be because some OPs might be used to 'compiling' code versus 'compiling' data angle that you are using.
This is especially evident by make dependencies discussion with lars512.
To give a simple specific example: I have a dataset of say 5000-50000 SKUs that are aggregated across 9-12 dimensions. My final report/analysis uses 3 scenarios. Now one sub-set of one scenario has changed [that's the raw input] - of course running 'data compilation' by using data that changed and ONLY what depends on it is the most effective&efficient approach.
Thank you very much for your kind words and support, and we certainly are looking forward to your feedback, feature requests and bug reports, as well as your code contributions, should you so desire.
We built this based on our own pain points with a larger audience in mind. We hope we got some things right, because the success of any tool is defined by its users. So, if you like it, let's build a thriving community together!
Whoa, this is the first time i'm hearing of "Factual" but playing around i'm impressed! There was a side project I had a while ago, which i eventually gave up because I couldn't source some data. These guys found it!
I like the idea that the tasks can be implemented in any language, but I feel like this has limitations compared to something like Rake, where the step definition is code, too. What this means is that in Rake I am not just limited to defining new task bodies, but new ways of defining tasks themselves.
I see that Drake is implemented in Clojure, so I'd imagine you understand the value of homoiconicity and extensible languages. So I wonder why you didn't just use Clojure all the way through?
In short, we don't feel like it's an either or question. We want to have Drake as a command-line frontend to the core functionality, but we would love to see/have other frontends developed as well. Currently, there's no Clojure DSL for Drake, but I think it'd be totally awesome.
The reason we started from command-line is because our workflows are heterogenous, and we also didn't want to limit Drake to developers and associate it with coding. Clojure can be quite a big learning curve if you only need it to specify steps and link them together through file dependencies.
We had an important design goal in mind: Drake should be as simple as writing a shell script. If it's not, our experience shows that most workflow start as trivial shell-scripts with one or two steps, and by the time it grows into something unmanageable, it's kinda too late. :)
On a related note, Drake supports Clojure code inlining for manipulation of the parse tree. It's not an equivalent, just a somewhat related feature. It allows you to modify the steps, dependencies, and anything else in the parse tree directly from Clojure.
I'm glad the step definitions are not in Clojure or a unified programming language. It makes it much easier to pull in data specialists, product managers, and other non-engineers to help build and maintain a data workflow while leaving them the autonomy to run and troubleshoot the steps of the build specific to their skillsets.
There seems to be few differences between Drake and just rolling out Makefiles for data processing, but I definitely see this project has potential. Distributed processing over AWS/Compute Engine/etc. clusters would be one nice thing to have, as a kind of simpler alternative to Hadoop.
I really like the inline, multi-language scripting though.
Thanks! We feel that in practice, there's quite a lot of differences between Drake and most Make-like systems. See this response for details: http://news.ycombinator.com/item?id=5111527
Perhaps I am the only one having issues here, but I cannot seem to get drake to run. Is there anything that is supposed to be done after building the uberjar?
Further, I don't understand how I'm supposed to alter my path to be able to run drake by simply entering 'drake'- would it be possible to get some help?
The project's README file (https://github.com/Factual/drake - scroll down) contains building and running instructions, as well as how to create a simple script to run Drake which you can put on your PATH.
My mistake was that I didn't realize I was supposed to have Drake.jar in the same folder as the workflow that I was trying to execute (I'd keep getting the error 'Unable to access jarfile drake.jar'). Naive error, I suppose.
However, I'm still having trouble executing the 'A nicer way to run Drake' instructions. I created a file named 'drake' on my path, and inserted the given text. However, I keep getting the error
'Exception in thread "main" java.lang.NoClassDefFoundError: drake/core'
Was I supposed to alter the script in any way? I just naively copy/pasted.
You don't have to have Drake.jar in the same folder as the workflow you're trying to execute.
You create the script as described in the documentation, and you put it somewhere on your PATH along with the JAR file. The JAR files has to be in the same directory as the script.
Actually, it was in the doc. If you followed the instructions below precisely, just send us your terminal log so that we can see what you're missing.
A nicer way to run Drake
We recommend you "install" Drake in your environment so that you can run it by just typing "drake". Here's a convenience script you can put on your path:
Save that as `drake`, then do `chmod 755 drake`. Move the uberjar to be in the same directory. Now you can just type `drake` to run Drake from anywhere.
Am I the only one who immediately thought of Drake the rapper? He's pretty famous, not sure if this was considered during the naming process. Even if it's not a legal problem, it's an SEO/social media problem.
Although I don't agree that the name "Drake" is an issue, I do find it interesting that an even more apt name for an application of this type might be "Usher"!
True, but I wouldn't call my product 'Queen', 'Cream', 'Journey', or another noun that could be confused with someone or something famous. This distracts from the conversation of the product, so perhaps I shouldn't have brought it up.
Thank you. Why not? We would love to see it, but we're also not actively using Amazon S3 at the moment. But we would be more than happy to review code contributions.
Adding a new filesystem to Drake's source is very easy. You just create a filesystem object that implements a bunch of methods for: listing directory, removing file, renaming file and getting file's timestamps, and then put it along with the corresponding prefix in the filesystem map. That's pretty much it. Assuming there's client JAR for Amazon S3, written either in Clojure or in Java, it should be quite simple to do.
We love Clojure. Lisp is an extremely powerful language, and Clojure brings all this to the practical JVM world. And Lisp is quite good in operating on lists and graphs, which is a big part of Drake.
out of curiosity, why did you go the clojure route instead of the scala route? From what i understand, scala has more libraries available, including ai and nlp libraries but maybe my impression is not correct?
It's hard to compare Clojure and Scala. Scala is a multi-paradigm programming language with strong OOP support and functional support. It's arguably more verbose than Clojure but looks much more similar to Java.
Clojure is a Lisp. Lisp stands aside all other programming languages, first of all, because it supports syntactic abstraction (a.k.a. "code is data"). Hardcode addicts (I'm not one of them) say there are only two programming languages - Lisp and non-Lisp.
When we made the decision to switch to Clojure, several things affected it, in no particular order:
- we had some people who were already very proficient in Lisp
- we liked how expressive and compact it was
- Lisp is considered to possess immense expressive power (see http://www.paulgraham.com/lisp.html)
- we were enamoured by Cascalog (http://nathanmarz.com/blog/introducing-cascalog-a-clojure-ba...), and it's written in and for Clojure. This one payed off very well.
- Lisp has a reputation of being great at manipulating data: lists, graphs, etc.
As for libraries, both Clojure and Scala are JVM-based, and Clojure has a very good syntax for Java interop, so all Java libraries are available to us. But, of course, Clojure community also spits out libraries like crazy, for example, take a look at this marvel which we use in Drake for parsing: https://github.com/joshua-choi/fnparse.
Thanks for your feedback. I've been playing around with both languages, and was leaning towards scala since it seemed more likely i could use it professionally, even though i liked clojure a bit more, sortta like the lisp like syntax.
- The complexity of your analysis. - How fixed your pipeline is over time. - The size of a data set. - How many data sets you are running the analysis on. - How long the analysis takes to run.
If you are only doing one or two tasks, then you barely need a management tool, though if your data is huge, you probably want memoization of those steps. If your pipeline changes continuously, as it does for a scientist mucking around with new data, then you need executions of code to be objects in their own right, just like code.
Make-like systems are ideal when:
- Your analysis consists of tens of steps. - You have only a couple of data sets that you're running a given analysis on. - The analysis takes minutes to hours, so you need memoization.
Another Swiss project, openBIS, is ideal for big analyses that are very fixed, but will be run on large numbers of data sets. It's very regimented and provides lots of tools for curating data inputs and outputs. The system I wrote was meant for day to day analysis where the analysis would change with every run, was only being run on a few data sets, and the analysis tool minutes to hours to run. Having written it and had a few years to think about it, there are things I would do very differently today (notably, make executions much more first class than they are, starting with an omniscient debugger integrated with memoization, which is effectively an execution browser).
So bravo for this project for making a tool that fits their needs beautifully. More people need to do this. Tools to handle the logistics of data analysis are not one size fits all, and the habits we have inherited are often not what we really want.