In 2015, they evaluated Apache Solr and Elasticsearch and decided to build their own (Firefly). They said, other solutions did not scale. So, instead of contributing to scaling (like Apple and Bloomberg and Cloudera did), they went the other way. Now, they seem to be doing it again (at least they are using Tika).
In a meanwhile, Solr implemented most of the features they are describing in their architecture document.
13) Search orchestrator seems to be a couple of features on top of Solr's existing routing linked earlier. There were individual approaches/3rd-party modules doing some of these (shadow, federation, ACL). Some of this is definitely unique to Dropbox though.
14) Precision vs Recall vs Ranking is too many links, but there is a whole book on this: https://www.manning.com/books/relevant-search (mostly about Elasticsearch, but Solr has added some new features recently to make it even better)
And a lot more (Solr Reference manual is more than 1300 pages....).
Obviously, this is a bit of a dig at a Dropbox reinventing the wheel again (or perhaps this time actually using Lucene, but forgetting to attribute it so far).
But more importantly, it is a message to others that got excited by their architecture post. You can have a similar battle-tested system for yourself, for free. And if something is not perfect, you can fix it and help the rest of the world too. We are always happy to see new contributors.
Finally, if you know Apache Solr well, it is not just Dropbox you can work for, but also Lucidworks, Bloomberg, Cloudera, Alfresco, Shutterstock, Dice, CareerBuilder, and many others.
Why is that important? Is it advantageous versus the alternatives? (Genuinely curious)
I have been using GNU libextractor but I see Tika quite often brought up in the same breath. When I tried Tika a while back I didn't find it as good nor as fast. Has that changed?
Tika is a very active project that Solr also uses. And they rely on other good libraries.
If libextractor is sufficient for you, that's great. If you hit its limitation, try Tika.
Some use-cases I know of include
- Parsing Microsoft Office Files
- Doing OCR on images
- Running Tika as a standalone server with HTTP interface
Tika is most definitely a secret component inside a lot of systems that extract content/metadata from files. So, Dropbox leveraging Tika was a good move and worth recognizing. Especially, given that the rest of their choices does not quite make sense (based on the limited information provided).
Entirely possible. Yet, this is what Apple presented in 2014:
Jessica Mallet from Apple, Inc. gave a presentation on how Apple uses SolrCloud. She briefly outlined some terms and concepts and then dug into how Apple built a multi-tenant search platform with each cluster holding around one million logical indexes. She also explained how their automation tool SolrLord uses alarms to trigger several events and can fix issues without any human interaction.
It's OK to not like Solr. At the same time, half of the features I listed above, are quite new (SolrCloud, docValues, LTR, Config and Schema API, JSON support, etc).
And the other half - the 'old' part you may not like - are battle-tested, multiple-times speed-optimized pieces of code. Like this one: http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is... Moreover, their architecture makes very clear that they are making very similar choices, it is just their implementation is much fresher.
Sure, there is crud in Solr, it is an open-source product driven by the user needs. Sure, it is possible that - for some usecases - Java is disadvantage.
I would have loved that refreshed comparison to be in the article. It is very jarring that it was not. As it is, it felt that they walked away from 2015 and have not looked since. Even though their "simpler" approach did not work out and they had to throw it away.
They only mention Tika and Kafka, both I believe are written/using Java. I think the next article is supposed to give more details, I am looking forward to that.
I am very much looking forward to forthcoming posts describing the actual architecture and specifics -- this is a great high-level overview, but I hope and expect they will expand on this expose soon.
Cool to hear about a revamp. It confuses me why some projects use a well-known open source name. In Linux, it's primary desktop environment (Gnome) uses Nautilus as a file manager. Dropbox even has a package for Dropbox/Nautilus integration.
What alternative have Linux users to Dropbox, to using their own server? (That is, for a number of reasons for most people, suboptimal)?
You cant really use OneDrive365, and Dropbox offers vast support to Linux, is easy to set and can be used for free too.
Is there a reason why Linux users wouldn't use it? Asking for curiosity.
ext4 on encrypted devices, such as dm-crypt/LUKS, is supported. What is not supported are encryption filesystems that are 'filesystem overlays', such as ecryptfs.
(Since I was using ZFS, I am still debating whether to stay with Dropbox after November's filesystem apocalypse.)
FYI, you can still use ecrypt fs with dropbox. Put the encrypted store within your dropbox, and mount it outside your dropbox. From the dropbox point of view, you have thousands of files with gibberish as names.
Better go with Nextcloud than ownCloud. Nextcloud is the fork by the original developer team, and has quite a few nice improvements compared to ownCloud (e.g.: video and text chat, e2e encryption)
I use SpiderOak One [1] which is a privacy focused alternative to Dropbox. I run it on Ubuntu (and previously on Debian and Arch Linux). There's no free tier like there is with Dropbox though.
For a Linux user, you can already build such a system yourself quite trivially by getting an FTP account, mounting it locally with curlftpfs, and then using SVN or CVS on the mounted filesystem. From Windows or Mac, this FTP account could be accessed through built-in software.
kbfs is a network filesystem, whereas Dropbox provides file synchronization. kbfs does not work when you are offline, whereas with Dropbox the files are always available locally on your machine.
It's funny because the only reason I know of Nautilus on Ubuntu is from installing Dropbox. To use the Dropbox daemon on Ubuntu you have to install it and then restart Nautilus.
That's why open source projects need to register their trademarks. That's how Gnome managed to stop Groupon from ripping off the name for their own project.