You've just invented the field of program synthesis, which is a super active research area right now in programming languages! Here's the paper list from a class at UW focused on this area: http://courses.cs.washington.edu/courses/cse599a2/15wi/ . The Sumit Gulwani paper from the first week is a really good introduction to the field.
Speaking from academic computer science, I have never heard of anyone making the decision not to work on something for this reason--perhaps because if an idea could immediately be turned into a successful company, it's probably not good research! And most (all?) CS PhD programs (which include an MS in passing in the USA) guarantee full funding for a number of years and most successful students are funded the whole time.
You absolutely shouldn't feel "mentally inferior" to folks with PhDs. Doing academic research, which probably involves getting a PhD, is almost certainly the best way to make a significant contribution to Computer Science. These two facts aren't really in conflict, I don't think.
We are reading! This is a great question. We currently don't prove anything about liveness. We'd love to work on this.
As you probably know, Raft and other consensus algorithms are not guaranteed to be live in all situations. But subject to some assumptions about the frequency of failure, they are guaranteed to make progress.
In Verdi, systems are verified with respect to semantics for the network they are running on. Our semantics currently don't include any restrictions about how often failures can happen; a failure "step" can occur at any time. We're not sure what the best way is to introduce this kind of restriction, but we've got a couple ideas. One would be to guarantee that the total number of failures has some finite bound which is unknown to the system itself but which is available during verification. Another would be to model failure probabilistically. We will probably end up doing at least one of these things in the next year or so :).
Thanks. That sounds like extremely interesting research. Being able to say things about liveness relative to probability of failure (or the distribution of probability of failure) would be very interesting.
Hey, I'm Doug Woos--thanks for this excellent summary! It's worth noting that the Raft proof was completed by a team of people, including me, my research partner James Wilcox (http://homes.cs.washington.edu/~jrw12/), and the other folks listed on our web page at http://verdi.uwplse.org/
Apologies to your research partner for leaving him out, it's edited! Congratulations on the proof by the way, I am looking forward to this week-end so I can have some time to appreciate it with more depth! :-)
Thanks! Let us know if you have any questions about the proof itself--unfortunately, it's not particularly well-documented, and Coq isn't super easy to read in the first place.
At this point, the standard intro text is Software Foundations [1]. I highly recommend it; it will teach you Coq and also probably make you a better programmer. After SF, Certified Programming with Dependent Types [2] gets more into the practice of proving serious programs correct. These books are both available online in the form of literate Coq files.
As far as online lectures, OPLSS [3] often has Coq lectures which are quite good.
One could imagine doing some kind of differential testing of another key-value store against vard to ensure that for a random but large number of requests, they return the same value. I'm not sure that doing that would gain you anything over using Jepsen [1], though. One could also imagine implementing another key-value store in Verdi and proving it equivalent to vard; reasons to do this might include using better data structures or serialization mechanisms, or a different consensus protocol. I'm not sure that this would require significantly less effort than proving this new key-value store (usrd? whittend? :)) correct from scratch, but I'd love to see someone do this!
Your objections to MongoDB's model seem reasonable, but I don't see any evidence in either this comment or the linked blog post that DocumentDB is better (especially in the absence of benchmarks). What is this "battle-tested distributed systems framework"? Several of your complaints about MongoDB have to do with the interaction between persistence to disk and replication to the network; as the Multi-Paxos algorithm does not specify when data should be written to disk (much less what the format should be), what reason is there to believe that DocumentDB does this any better?
I'm totally willing to believe that DocumentDB beats the pants of MongoDB on just about every axis (in fact, that seems pretty likely) but it's going to take some actual numbers and a better description of the internals.
I agree with you - we need numbers before making that kind of conclusion and I haven't run any benchmarks on the public version of DocDB. I'd like to see someone measure MongoDB on Azure vs DocDB on Azure - even then it might not be a fair measurement of db vs. db, since we don't know what machines DocDB is hosted on.
All I can really say is that the replication model provides a significant performance boost over MongoDB in the multiple replica (i.e., production) scenario.
We were using MongoDB at Microsoft for a while (I left MS almost a year ago). I was developing a real-time metrics system with it. It was very unstable at our target load (500k increments per minute, high percentage of tomorrow's documents preallocated the day before). We only managed maybe 10% of that with MongoDB, IIRC. Sometimes it would choke and not come back until I restarted the cluster (~30 machines total, I believe. 3 replicas * 10 shards).
We were so sure that MongoDB should be able to handle this scenario, since they talk about it in their documentation. After talking with the MongoDB devs, we came to the conclusion that even though we were issuing increment operations on preallocated documents, MongoDB was:
a) using a global lock on the "local" db used for replication, and
b) "replicating via disk" instead of via the network. In other words, replication requires writing to the journal journal before other members of the replica set have a chance to apply the change and ack back. This results in a loss of concurrency.
The lack of async query support in the C# driver didn't help either.
Eventually we used a replicated, write-back cache which sits atop the framework DocDB uses. Not a fair comparison, but the goal was achieved easily with 1/3rd the hardware. We just backed it onto Azure Table Storage. Our queries were all range queries, which table storage supports.
Did you read the whole article? If not, go ahead and read the part about pressuring the intern into giving his assistant oral sex while he (Kennedy) watched. That is absolutely not "normal behavior".
I'm honestly a bit scared if you view this as normal. I would really consider seeing a professional, if I were you, just to get those assumptions of yours checked.
They might agree with you, in which case I'm wrong