Hacker Newsnew | past | comments | ask | show | jobs | submit | xyzzy_plugh's favoriteslogin

In SQLite, transactions by default start in “deferred” mode. This means they do not take a write lock until they attempt to perform a write.

You get SQLITE_BUSY when transaction #1 starts in read mode, transaction #2 starts in write mode, and then transaction #1 attempts to upgrade from read to write mode while transaction #2 still holds the write lock.

The fix is to set a busy_timeout and to begin any transaction that does a write (any write, even if it is not the first operation in the transaction) in “immediate” mode rather than “deferred” mode.

https://zeroclarkthirty.com/2024-10-19-sqlite-database-is-lo...


> sqlite in its default (journal_mode = DELETE) is not durable.

Not true. In its default configuration, SQLite is durable.

If you switch to WAL mode, the default behavior is that transactions are durable across application crashes (or SIGKILL or similar) but are not necessarily durable across OS crashes or power failures. Transactions are atomic across OS crashes and power failures. But if you commit a transaction in WAL mode and take a power loss shortly thereafter, the transaction might be rolled back after power is restored.

This behavior is what most applications want. You'll never get a corrupt database, even on a power loss or similar. You might lose a transaction that happened within the past second or so. So if you cat trips over the power cord a few milliseconds after you set a bookmark in Chrome, that bookmark might not be there after you reboot. Most people don't care. Most people would rather have the extra day-to-day performance and reduced SSD wear. But if you have some application where preserving the last moment of work is vital, then SQLite provides that option, at run-time, or at compile-time.

When WAL mode was originally introduced, it was guaranteed durable by default, just like DELETE mode. But people complained that they would rather have increased performance and didn't really care if a recent transaction rolled back after a power-loss, just as long as the database didn't go corrupt. So we changed the default. I'm sorry if that choice offends you. You can easily restore the original behavior at compile-time if you prefer.


This comment got me digging, n8n actually has a pretty good post-open source license[1] - I'm glad to see more successful examples of this sort of licensing in the wild

[1]: https://docs.n8n.io/sustainable-use-license/


The backstory is complicated. The plan was to establish a consortium between CMU, Tsinghua, Meta, CWI, VoltronData, Nvidia, and SpiralDB to unify behind a single file format. But that fell through after CMU's lawyers freaked out over Meta's NDA stuff to get access to a preview of Velox Nimble. IANAL, but Meta's NDA seemed reasonable to me. So the plan fell through after about a year, and then everyone released their own format:

→ Meta's Nimble: https://github.com/facebookincubator/nimble

→ CWI's FastLanes: https://github.com/cwida/FastLanes

→ SpiralDB's Vortex: https://vortex.dev

→ CMU + Tsinghua F3: https://github.com/future-file-format/f3

On the research side, we (CMU + Tsinghua) weren't interested in developing new encoders and instead wanted to focus on the WASM embedding part. The original idea came as a suggestion from Hannes@DuckDB to Wes McKinney (a co-author with us). We just used Vortex's implementations since they were in Rust and with some tweaks we could get most of them to compile to WASM. Vortex is orthogonal to the F3 project and has the engineering energy necessary to support it. F3 is an academic prototype right now.

I note that the Germans also released their own fileformat this year that also uses WASM. But they WASM-ify the entire file and not individual column groups:

→ Germans: https://github.com/AnyBlox


I tend to think that much of the difference between great and mediocre engineers comes down to mindset. The great engineers I've encountered have a commitment to making everything they touch better. They are adaptable and persevere. They believe they will either succeed at the task at hand or conclusively determine that it isn't possible given the current constraints. They recognize that failure will happen and do not get discouraged by it and instead see it as an opportunity to growth. When they encounter an issue caused by systemic problems, they will try to fix it systemically. They will not ship the first draft and will take a little it of extra time to get things right, which slows the initial release slightly but more than pays for itself with a dramatically lower maintenance burden.

This type of engineer is often misunderstood and underappreciated by management. Management is often motivated by immediate short term goals. Instead of cherishing the work of the engineers who build the foundational systems that will enable the long term success of the org, they complain about them missing arbitrary short term goals and accuse them of doing engineering for engineering sake instead of real work.

Management will celebrate the coder who whips up a buggy, but cool, feature in a week and will look the other way at the fact that the feature will always be a little bit broken because of its shoddy construction and instead will allocate some lesser engineers to maintain it. If instead the feature had been built correctly from the start, it may have been launched a bit later, but the overall cost will be much lower. Moreover, the poor engineers who are forced to maintain it (and let's be honest, the people who quickly churn out shoddy but shiny work almost never have to maintain it themselves) will not only be wasting their time, they will be actively internalizing the anti-patterns present in the code. This both inhibits their growth and understanding of good design principles and teaches them the bad lesson that crap is good and unless they have a uniquely strong character or good mentors, they will tend to either become demoralized (hurting their productivity and value to the company) or they will internalize the incentive to get out of maintenance work and build shoddy features of their own to pass down to the next poor soul.

The truly great engineer is the one who breaks this cycle by building extendable systems that are correct by design and takes accountability for everything they ship. They raise up the entire org both by their work and example. In the long run, they will be far more productive and positively impactful than the sloppy cowboy coder.

Unfortunately, the industry writ large celebrates and incentivizes cowboy coding so doing the right thing is very much against the grain. Indeed, the people who rise up the org chart tend to be the cowboys so they cannot even see the value of the other way and will often actively antagonize or undermine those who do the right thing (and threaten their dominant position in the org).


Generally good points. Unfortunately existing file formats are rarely following these rules. In fact these rules should form naturally when you are dealing with many different file formats anyway. Specific points follow:

- Agreed that human-readable formats have to be dead simple, otherwise binary formats should be used. Note that textual numbers are surprisingly complex to handle, so any formats with significant number uses should just use binary.

- Chunking is generally good for structuring and incremental parsing, but do not expect it to provide reorderability or back/forward compatibility somehow. Unless explicitly designed, they do not exist. Consider PNG for example; PNG chunks were designed to be quite robust, but nowadays some exceptions [1] do exist. Versioning is much more crucial for that.

[1] https://www.w3.org/TR/png/#animation-information

- Making a new file format from scratch is always difficult. Already mentioned, but you should really consider using existing file formats as a container first. Some formats are even explicitly designed for this purpose, like sBOX [2] or RFC 9277 CBOR-labeled data tags [3].

[2] https://nothings.org/computer/sbox/sbox.html

[3] https://www.rfc-editor.org/rfc/rfc9277.html


> you get the added benefit of writing queries in JSON instead of raw SQL.

I’m sorry, I can’t. The tail is wagging the dog.

dang, can you delete my account and scrub my history? I’m serious.


The bill rate for serious end-to-end web application development in a modern language, where the customer provides a loose spec and the outcome is "working, usable application" is somewhere in between 100-200/hr on the very conservative (low) end.

So, take your outside estimate of how much time the project is going to take (you think 40 hours; maybe make it 48), and pick a bill rate in the spectrum of 100-200/hr. Work that into a daily rate.

Provide proposal under the following terms:

* You offer to build the project on a time & materials basis, 6 days at a daily rate of ((100..200)x8).

* Prior to starting the engagement, you will complete a proposal with detailed acceptance criteria (app MUST do this, app MUST NOT do that, app MAY do this, &c). You'll interview your client (gratis, because you're a professional) to work out that acceptance criteria in advance.

* Should any of the requirements or acceptance criteria change once contracts have been signed, you'll accommodate by adding additional billable days to the project (subject to whatever scheduling constraints you may have at the time).

* Should they want to lock in additional billable days in anticipation of changing requirements out from under you, they can buy those billable days on a retainer (use-it-or-lose-it) basis at a small (~10%) discount to your daily rate. This (or some clause like it) allows them to pay you extra (ie, for days they buy but you don't work) to guarantee that if they change their mind about some goofy feature, you'll be available immediately to accommodate, and they don't have to wait 6 months.

* Your contract will specify a class of critical bugs (security, things that potentially lose previously-stored data) and a class of major bugs (things that make the system unusable). For a period of N months (maybe 12) after delivery, you'll commit to fixing critical bugs within N days, for free if they take less than N hours to fix, and at your daily rate otherwise; repeat (but with more favorable terms for you) for major bugs.

* For an annual fee of 20% of the price of the contract, you'll provide maintenance, which (a) continues the critical/major bug fix window past the N months specified above and (b) provides an annual bucket of K billable days with which you will fix non-major bugs; this is provided on a retainer (use-it-or-lose-it) basis.

The idea is simple: you want to:

(a) Give the client something that looks as close as possible to a fixed-price project cost.

(b) Not actually commit to a fixed-price project cost more than you have to.

(c) Turn the downside of long-term bugfix support into an upside of recurring revenue.

This is just a sketch, you'd want to tune these terms. You'll also want to pay a lawyer ~$100-$200 to sanity check the contract. (Your contract will look like a boilerplate consulting contract, ending in a "Statement of Work" that is a series of "Exhibits" [contract-ese for appendix] that spells out most of the details I listed above). Pay special attention to the acceptance criteria.

Remember also that you're liable for self-employment tax, which is due quarterly, not on April 15. You might also consider registering a Delaware LLC (~$100, online) and getting a tax ID, because liability gets sticky with software you deliver to make someone else's business work. You probably do not need to consult a lawyer about LLC formation; most of the trickiness of company formation is with partnership terms and equity, which isn't your problem.


Here's my neo-Luddite take on this. Slides with support for notes in a synchronized second window in just 371 bytes of minified javascript, some HTML and some CSS:

    let a=[...document.getElementsByClassName("slide")]
      .map((a,b)=>[a,"slidenote"==(b=a.nextElementSibling)
      ?.className?b:a]),b=0,c=0,d=()=>a[b][c]
      .scrollIntoView(),e=new BroadcastChannel("s"),
      l=a.length-1;d();e.onmessage=({data:a})=>{c^=a.c,
      b=a.b,d()};document.addEventListener("keypress",
      ({key:f})=>{b+=(f=="j")-(f=="k");b=b<0?0:b>l?l:b;
      c^=f=="n";e.postMessage({c,b});d()});

    div.slide, div.slidenote {
        height: 100vh;
        width: 100vw;
        /* Other slide styling options below */
        ...
        ...
    }

    <div class="slide">
      Anything in here is one slide
    </div>
    <div class="slidenote">
      (optional) Anything in here is
      a note for the slide above
    </div>
You can trivially use the HTML and CSS inside markdown, so any markdown parser that generates HTML is now an ultra-lightweight slides generator.

For a deeper explanation, see Dave Gaur's original minslides[0] and my own presentation on how I added note-support to it and golfed the JS code[1].

[0] https://ratfactor.com/minslides/

[1] https://nbd.neocities.org/slidepresentation/Slide%20presenta...


This article touches on a lot of different topics and is a bit hard for me to get a single coherent takeaway, but the things I'd point out:

1. The article ends with a quote from the Backblaze CTO, "And thus that the moral of the story was 'design for failure and buy the cheapest components you can'". That absolutely makes sense for large enterprises (especially enterprises whose entire business is around providing data storage) that have employees and systems that constantly monitor the health of their storage.

2. I think that absolutely does not make sense for individuals or small companies, who want to write their data somewhere and ensure that it will be there in many years when they might want it without constant monitoring. Personally, I have a lot of video that I want to archive (multiple terabytes). I've found the easiest thing that I'm most comfortable with the risk is (a) for backup, I just store on relatively cheap external 20TB Western Digital hard drives, and (b) for archival storage I write to M-DISC Bluerays, which claim to have lifetimes of 1000 years.



I love Postgres and relational databases in general, but the lack of documentation generation found in general purpose programming languages for the last 20 years is a glaring omission.

There simply are no good equivalent to Javadoc for SQL databases. (Except you Microsoft. I see you and MS SQL Server tools. I just don't work on Windows.) You have ERD tools that happily make a 500-table diagram that's utterly useless to humans, are ugly as sin, are not at all interactive, etc. Seriously, Javadoc (and Doxygen, etc.) have been on point since the 1990s.

OOP and functional design patterns are common knowledge while 50-year-old relational concepts are still woefully unknown to far too many data architects.

Folks don't write docs. They just don't. They start. They agree it's important. But they either get skipped or they stagnate. The only real option is automatically generating. Reading the column and table comments, the foreign key relationships, the indexes, the unique and primary keys, the constraints, etc. and rendering it all into a cohesive and consistent user interface. Not the data. The structure.

It's the single biggest tooling failure in that sector in my opinion.


I think honestly the best solution really is to just use a stock PC and forget all of this crap. It's a shame there aren't any good open source setups using stock computers like Raspberry Pi that can act as a good Chromecast replacement (or if there are, I missed on it; I tried Kodi but while it is pretty cool it isn't really great for streaming services like YouTube in my opinion.) but on the other hand, it's not the end of the world.

Many modern TVs, if you can find one that isn't complete dogshit (good fucking luck), can do Miricast without connecting to the Internet or requiring an account. That's nice since it fills one role of Chromecast: the ability to easily cast your desktop.

But I'd like a full open source ecosystem implementing casting. Right now using Chromecast protocols from Firefox is a crapshoot and I just haven't bothered, but I don't think there's any reason why we can't just make our own. YouTube may be somewhat hostile, but at a certain point it's hard to stop a cast tool that just execs an official Google Chrome binary, you know? So there's always something that could be done.

That said, I keep a list of instructions for un-shittifying the Google Chromecast TV devices for myself, since I do have a few of them. Note that you already need to log in for this to really work, but I already do that, since I want to be logged into YouTube, for the time being (for Premium and age-restricted videos and subscriptions and etc.)

I'll just copy and paste them here:

    ## Replacing the Terrible Launcher
    Google took a dump all over the TV launcher with ads. Here is a workaround:
    1. Enable *Developer Mode* by tapping the TV OS Build Number in Settings -> About 7 times.
    2. Enable USB debugging.
    3. Prepare a device with `adb`. On NixOS, `nix shell nixpkgs#android-tools`.
    4. Find the IP in About -> Status and use it to do `adb connect [IP]`.
    5. Install an alternative launcher like ATV Launcher Pro.
    6. Disable the default launcher entirely. `adb shell "pm disable-user --user 0 com.google.android.apps.tv.launcherx && pm disable-user --user 0 com.google.android.tungsten.setupwraith"`
    ### Button Mapper
    Google also made their version of Android extra hostile to the launcher being replaced, so when you disable the launcher the Home and YouTube buttons will stop working. This can be fixed using a third party app called _Button Mapper_ available on Play Store.
    1. Install _Button Mapper_ from Play Store.
    2. Enable the Button Mapper Accessibility Service in Settings.
    3. Add the Buttons
    4. Map YouTube to open the YouTube app
    5. Map Netflix to open the Jellyfin app
    The app will warn about not working if the device sleeps, but this doesn't apply as these devices don't seem to "sleep" the way that Android phones and tablets do.
If your auth becomes stale you need to re-enable those app IDs and log back in. This will manifest as things simply not working, e.g. videos not playing. However, it only happened to me a couple of times. I think it requires session tokens to completely expire, which takes a while of inactivity.

Remember: if VCs believed in what they were doing they would not take a 2% annual management fee and 20% of the upside.

They’d take 40% of the upside and live on ramen noodles.

VCs make money by raising money from LPs.

They spend this money on investments which don’t look too bad if they fail, because nearly all of them fail. Looking good while losing all of your investors money on companies which go broke is the key VC skill.

Once in a while you get a huge hit. That’s a lottery win, there is no formula for finding that hit. Broad bets helps but that’s about it. The “VC thesis” is a fundraising tool, a pitch instrument, it makes no measurable difference to success. It’s a shtick.

Sympathy, however, for the VC: car dealership sized transactions paired with the diligence burdens of real finance. It’s a terrible job.

Once you understand that VC is one of the worst jobs in finance and they don’t believe most of their own story — it’s fundraising flimflam for their LPs - it’s a lot easier to negotiate.

1) we are a sound bet not to get you in trouble if we fail (good schools and track records)

2) we will work hard on things which your LPs and their lawyers understand, leaving evidence of a good effort on failure

3) we know how the game works and will play by the unwritten rules: keep up appearances

The kind of lunatics who actually stand to make money with a higher probability than average - the “Think Different” category - usually violate all of these rules.

1) they have no track record

2) they work on esoteric nonsense

3) they look weird in public

And they’re structurally uninvestable.

Once you get this it’s all a lot easier: the job of a VC is not to invest in winners, that’s a bonus.

The job of a VC is to look respectable while losing other people’s money at the roulette wheel, and taking a margin for doing so.

I hope that helps.


I don't know the history of this bug but just want to chime in with a word about how absolutely terrifying the "associate email address with account" feature in account-based web apps is. Which, I guess that's my word: terrifying, one of the things pentesters make a beeline to mess with, with a vulnerability history stretching all the way back to the early 2000s when these features were often implemented on standard Unix MTAs that could be tricked into sending password resets to multiple addresses at once, an attack featureful web frameworks seem to have resurrected in Gitlab.

If you're a normal HN reader that found themselves interested in this story, go check your password reset feature, specifically the email association logic!

Gitlab has, as I understand it, a pretty excellent security team, which gives some sense of how hard this bug class is to avoid.


(2nd user & developer of spark here). It depends on what you ask.

MapReduce the framework is proprietary to Google, and some pipelines are still running inside google.

MapReduce as a concept is very much in use. Hadoop was inspired by MapReduce. Spark was originally built around the primitives of MapReduce, and you see still see that in the description of its operations (exchange, collect). However, spark and all the other modern frameworks realized that:

- users did not care mapping and reducing, they wanted higher level primitives (filtering, joins, ...)

- mapreduce was great for one-shot batch processing of data, but struggled to accomodate other very common use cases at scale (low latency, graph processing, streaming, distributed machine learning, ...). You can do it on top of mapreduce, but if you really start tuning for the specific case, you end up with something rather different. For example, kafka (scalable streaming engine) is inspired by the general principles of MR but the use cases and APIs are now quite different.


Google itself moved on to "Flume" and later created "Dataflow" the precursor for Apache Beam. While Dataflow/Beam aren't execution engines for data processing themselves, they abstract away the language of expressing data computation from the engines themselves. At Google for example, a data processing job might be expressed using Beam on top of Flume for processing.

Outside of Google, most organizations with large distributed data processing problems moved on to Hadoop2 (YARN/MapReduce2) and later in present day to Apache Spark. When organizations say they are using "Databricks" they are using Apache Spark provided as a service, from a company started by the creators of Apache Spark, which happens to be Databricks.

Apache Beam is also used outside of Google on top of other data processing "engines" or runners for these jobs, such as Google's Cloud Dataflow service, Apache Flink, Apache Spark, etc.


I work in the defense industry, it’s very much like the aerospace industry in that we deal with human life as a consequence of our work. We have software QA departments that operate very much like manufacturing or aerospace QA.

Software QA provides nothing of value to software development; having it as a dedicated function works against the overtly stated goals of the function and counterintuitively acts to degrade quality within software by mandating strict top down process and brittle end-to-end testing.

Although Software QA is intended to be an independent verification body that provides engineering organizations with tools and resources, in practice they function as a moral crumple zone [1] within the complex socio-technical defense industrial system, being one of the groups that the finger will be pointed to when something goes wrong and absorb shock to the business in the event of a failure. As a result they have a strong incentive to highly systematize their work with specific process steps, to shield them from liability, which can be applied generically to all projects.

Good software teams build quality into projects by introducing continuous integration, unit testing, creating feedback, and tightening these feedback loops. This acts to find problems quickly and resolve them quickly. Software QAs need for high level, top down, generic systemization requires them to work against these principles in practice. Bespoke project specific checks, such as unit testing, is not viewed as contributing to the final product and is discouraged by leadership who see it as waste.

To give an example of how these dynamics destroy quality in software. I once found a bug in software on a piece of test equipment where a logarithmic search function was not operating on a strictly sorted list. When I pointed this out to my leadership I was told that if we changed any part of code, it would require a new FQT, which would be too expensive to conduct and was not in the budget. Although the bug would have been trivial to solve, and was clearly wrong and would not provide any benefits by remaining in the test equipment software, the process required for changes prevented solving the issue.

[1] https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2757236


Downloading a 50MB WASM blob to run an emulator to boot a Linux kernel and then start an HTTP server to handle a single request is utter madness, which is probably why this will become popular to do.

The referenced paper on uniserve: https://petereliaskraft.net/res/uniserve.pdf is interesting, but seems to focus on systems where storage and compute are colocated, but it doesn't discuss (or maybe I skimmed too quickly) more modern architectures where compute and storage are separated (usually with a caching layer built into the compute nodes). In those architectures, most concerns about shifting data around at query time are moot.

Also in my experience building the scatter-gather query functionality and re-aggregation is usually the easiest part. The hard part is figuring out how to build fair multi-tenancy and QoS into what is essentially a massively parallel user facing real-time data lake.


This looks great!

Please consider removing any implicit network calls like the initial "Checking GitHub for updates...". This itself will prevent people from adoption or even trying it any further. This is similar to gnu parallel's --citation, which, albeit a small thing - will scare many people off.

Consider adding pivot and unpivot operations. Mlr gets it quite right with syntax, but is unusable since it doesn't work in streaming mode and tries to load everything into memory, despite claiming otherwise.

Consider adding basic summing command. Sum is the most common data operation, which could warrant its own special optimized command, instead offloading this to external math processor like lua or python. Even better if this had a group by (-by) and window by (-over) capability. Eg. 'qsv sum col1,col2 -by col3,col4'. Brimdata's zq utility is the only one I know that does this quite right, but is quite clunky to use.

Consider adding a laminate command. Essentially adding a new column with a constant. This probably could be achieved by a join with a file with a single row, but why not make this common operation easier to use.

Consider the option to concatenate csv files with mismatched headers. cat rows or cat columns complains about the mismatch. One of the most common problems with handling csvs is schema evolution. I and many others would appreciate if we could merge similar csvs together easily.

Conversions to and from other standard formats would be appreciated (parquet, ion, fixed width lenghts, avro, etc.). Othe compression formats as well - especially zstd.

It would be nice if the tool enabled embedding outputs of external commands easily. Lua and python builtin support is nice, but probably not sufficient. i'd like to be able to run a jq command on a single column and merge it back as another for example.

Inspiration:

  - csvquote: https://news.ycombinator.com/item?id=31351393
  - teip: https://github.com/greymd/teip

I’ve often wondered why we don’t put a thin tiny proxy in front of psql (or augment pgbouncer, other proxies) to collect telemetry on all real world queries against the db. Being a middle man there would give you lots of visibility into every single query being made as well as how long it takes. I guess the modern integrated stats tools help.

I think the real problem is not indexing correctly but rather modeling your problem correctly. You can throw indexes at the problem but that is sometimes just a bandaid to a more integral issue which would be inventing a new schema, new access patterns etc.


Jeff's impact on Google simply cannot be overstated. I was privileged to sit near him and be invited to coffee on a regular basis when I was working on a skunkworks project back in 2009. I took advantage of that access to ask a number of questions, even though I was quite knowledgeable about Google history from following it on Slashdot since the late 90s.

In the very early days of google.com, Larry and Sergey and the earliest search engineers built crawl, index, and serving systems that were fairly good but building a new index with fresh pages was challenging. Google had chosen to do the "large amounts of unreliable hardware" approach and so individual server failures were common and the indexing system jobs would die half-way through and had to be restarted from scratch. The entire system was documented in a README file telling you what commands to run and when to complete and index, and apparently, it was more or less impossible to make a new index at a critically important phase (around the time people were starting to use Google in favor of Altavista).

Jeff's good friend Sanjay had just joined, leaving DEC WRL where they were both compiler optimization engineers (before that, Jeff was a grad student at UW, working for Craig Chambers, who later lead the Flume project that replaced mapreduce). Sanjay convinced Jeff that joining was a good idea and they worked together in the early days to build a new index using flaky hardware, applying techniques that Jeff and Sanjay had learned during their academic and DEC days.

In particular, one key step of indexing is building an inverted index. For every query term in your corpus, make a list of documents that contain that term. The document list is typically ordered, by decreasing document frequency. Parts of this computation can be spend up by partitioning by document and then merging the partition's results. If you have a fast enough filesystem that scales with many clients, you can greatly speed up indexing with a cluster of machines, but the merge step has to be extremely fast, resource efficient, and restartable (without too much work that has to be redone) in case of machine failure.

Jeff and Sanjay looked at the problem and realized it could be implemented using the already well-known functional paradigm, "map and reduce": the map function is called with a document argument, and it emits term, document pairs (one for every term in the document). The term, document pairs, after some processing, are passed to a reducer, which operates per-term, merging all the documents with a mapreduce function. The result is a k/v table of term to document for the entire corpus, which can be used to implement keyword search efficiently.

The key innovation wasn't map, or reduce, but the shuffle step between the two. The mapreduce shuffler was a powerful and well-implemented system that was extremely resource efficient, parallelizable and restartable without much loss. Those who are familiar with the early Google interview question """Let's say you have a computer with 2M RAM. This computer has a hard drive (with lots of free space) and a 100M file which you should sort. Let me know how you, as effectivly as possible, sort the file.""" (from 2007: https://tech.slashdot.org/comments.pl?sid=232757&cid=1892574...). that question is literally what Jeff and Sanjay worked on while implementing the shuffle: it's a merge sort that runs in very little RAM

MapReduce later turned out to be useful for a wide range of tasks. Beyond the initial indexing steps, it was used for PageRank calculation. While I normally think of PageRank as a giant eigenvector problem to be calculated in one shot on a single machine with tons of RAM, they solved it with an iterative mapreduce, which allowed the company to scale PR calculations. Even trivial things like frequency counting to provide ranking signals work well in mapreduce, but it's very much a hammer that makes everything look like a nail.

I could write more details about the early days, for example their contributions to Ads, generations of storage systems (from GFS to spanner and beyond), data encoding (protocol buffers, while not perfect, have been an excellent tool for data storage and exchange), file formats (recordio and sstable and leveldb). But to be honest it's just enough to say that they played absolutely essential roles in both the existence of google, and its unbelievable success. For example, tensorflow got its start as a little project Jeff wrote after working on a previous machine learning system, and I believe his original code is still in the examples dir (calculating the eigenvalues of a matrix).

Around the time i was getting invited to coffee with Jeff and Sanjay, I was also writing documents saying that Google should get involved in health care research (I believe in a simple premise: we are massively underutilizing biological scientific data in large-scale health care research, and few groups have the knowledge, skill, and impetus to prove that). This was pretty successful, as I know people read my doc and made projects based on my ideas, and one day, a chat window opened up. It was a group chat with Jeff and Sanjay! They had written a mapreduce to compress genomic sequences "if we get them small enough we can fit all the world's genomes on SSD!" (jeff is a performance nerd) and wanted my help seeing if it was competitive and useful (unfortunately, it wasn't). But I certainly learned a bunch of cool tricks, and variations on the original technology eventually were launched through Google Cloud Genomics. I will be forever thankful that I got to hang around engineers who were far smarter than me and learn from them, even if my path deviated and I never got to do the large-scale machine learning research I had long wanted to work on.

I will conclude by saying that while Jeff had INT 18, Sanjay was had WIS 21, and it was the combination of their skills that was truly powerful. They pair programmed all day long and it was kind of crazy watching them communicate telepathically, then send out a changelist that optimized a single hot function that sped up most server binaries across the fleet by 1%.


I'd say the main arguments are:

1. Many transports that you might use to transmit a Protobuf already have their own length tracking, making a length prefix redundant. E.g. HTTP has Content-Length. Having two lengths feels wrong and forces you to decide what to do if they don't agree.

2. As others note, a length prefix makes it infeasible to serialize incrementally, since computing the serialized size requires most of the work of actually serializing it.

With that said, TBH the decision was probably not carefully considered, it just evolved that way and the protocol was in wide use in Google before anyone could really change their mind.

In practice, this did turn out to be a frequent source of confusion for users of the library, who often expected that the parser would just know where to stop parsing without them telling it explicitly. Especially when people used the functions that parse from an input stream of some sort, it surprised them that the parser would always consume the entire input rather than stopping at the end of the message. People would write two messages into a file and then find when they went to parse it, only one message would come out, with some weird combination of the data from the two inputs.

Based on that experience, Cap'n Proto chose to go the other way, and define a message format that is explicitly self-delimiting, so the parser does in fact know where the message ends. I think this has proven to be the right choice in practice.

(I maintained protobuf for a while and created Cap'n Proto.)


I've been 29 for decades.

[WarpStream co-founder and CTO here]

1. Each WarpStream Agent flushes a file to S3 with all the data for every topic-partition it has received requests for in the last ~100ms or so. This means the S3 PUT operations costs scales with the number of Agents you run and the flushing interval, not the number of topic-partitions. We do not acknowledge Produce requests until data has been durably persisted in S3 and our cloud control plane.

2. We think people shouldn't have to choose between reliability and costs. WarpStream gives you the reliability and availability of running in three AZs but with the cost of one.

3. We have a custom metadata database running in our cloud control plane which handles ordering.



This was (wait for it) interesting, but every item in "The Index of the Interesting" could be generalized as "Counterintuitive" or "Surprising." Regardless of the details, they all involve subverting expectations, assumptions or conventional wisdom. They're what Merlin Mann would call (checks username) "Turns Out."

As someone who is interested in what makes something interesting, I think the author has only part of the puzzle—what I would call Contrast.

I think Pattern is also inherently interesting. Imagine a researcher releases one paper every year applying their theory to different domains. Even if the papers themselves are not that interesting, the pattern is. Humans love to find patterns. But patterns can become tiring (think of an endless checkerboard pattern). If the researcher suddenly stops after 20 years, that's interesting—because breaking a pattern is Contrast.

Another inherently interesting thing is Relation. The transition between the bone and the spaceship in 2001 is interesting partially because of Contrast but also because of Relation—they're both images of "tools." One reason why it's exciting when a theory from one discipline is applied to a different discipline is that it suggests new relations, new links. "Canadian Tuxedo" is mostly funny/interesting because of the association it creates, versus the contrast between formalwear and denim.

Finally, there are certain subjects which are interesting to most people, but this is far more variable. In order, people tend to be most interested in: themselves, people they know, humans like them, humans very unlike them, humans in general, certain animals, and some cool plants. We prefer a photo of our own child over a photo of a stranger's kitten, (and that photo of a stranger's kitten over a photo of lichen) for reasons that are baked into our DNA. And a theory about kittens is also going to be more inherently interesting than a theory about lichen to a lay person.


One thing about LSM trees that are implemented with large numbers of large files in a filesystem, such as RocksDB, is that they defer to the filesystem to deal with fragmentation and block lookup isues. That's not actually free.

LSM tree descriptions typically imply or say outright that each layer is laid out linearly, written sequentially, and read sequentally for merging. And that looking up a block within a layer is an O(1) operation, doing random access I/O to that location.

But really, the underlying filesystem is doing a lot of heavy lifting. It's maintaining the illusion of linear allocation by hiding how the large files are fragmented. That sequential writing is mostly sequential, but typically becomes more fragmented in the filesystem layer as the disk gets closer to full, and over time as various uses of the filesystem mean there are fewer large contiguous regions. More fragmented free space makes the allocation algorithms have to do more work, sometimes more I/O, just to allocate space for the LSM tree's "linear" writes.

Lookup of a block inside a layer requires the filesystem to lookup in its extent tree or, with older filesystems, through indirect block lookups. Those are hidden from the LSM tree database, but are not without overhead.

Writing sequentially to a layer generally requires the filesystem to update its free space structures as well as its extent tree or indirect blocks.

Even a simple operation like the LSM tree database deleting a layer file it has finished with, is not necessarily simple and quick at the filesystem layer.

In other words, when analysing performance, filesystems are the unsung heroes underlying some LSM tree databases. Their algorithmic overhead is often not included in the big-O analysis of LSM tree algorithms running over them, but should be, and their behaviour changes as disk space shrinks and over time due to fragmentation.


<Troy McClure>Hi, I'm Kenton Varda. You may remember me as the creator of Cap'n Proto and the LAN-party-optimized house.

Back in 2007, I created "Jeff Dean Facts" as a Google-internal April Fool's joke. I wasn't funny enough to write any of the jokes myself, but I created the app that let people submit "facts", and was blown away by the results.

If you like primary sources, check out my write-up from when I first talked publicly about it in January 2012:

https://plus.google.com/u/0/118187272963262049674/posts/TSDh...

Don't miss the second comment, which has a long list of top-rated "facts".


Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: