Architecture of Nautilus, the new Dropbox search engine

Tetris1 · on Sept 28, 2018

I remmember when Dropbox released Firefly. It was simple and elegant. Nautilus is a monster in compare. It would be nice to see some pieces of code...

wiradikusuma · on Sept 28, 2018

Also worth mentioning: http://vespa.ai/

arafalov · on Sept 28, 2018

In 2015, they evaluated Apache Solr and Elasticsearch and decided to build their own (Firefly). They said, other solutions did not scale. So, instead of contributing to scaling (like Apple and Bloomberg and Cloudera did), they went the other way. Now, they seem to be doing it again (at least they are using Tika).

In a meanwhile, Solr implemented most of the features they are describing in their architecture document.

Specifically:

1)General scaling: https://lucene.apache.org/solr/guide/7_5/introduction-to-sca... (using ZooKeeper and SolrCloud)

2) Search Ranking and click-data training: https://lucene.apache.org/solr/guide/7_5/learning-to-rank.ht... (Contributed by Bloomberg)

3) Offline builds with substitution into production: https://lucene.apache.org/solr/guide/7_5/collections-api.htm...

4) Near-Real-Time: https://lucene.apache.org/solr/guide/7_5/near-real-time-sear...

5) Sharding specifically: https://lucene.apache.org/solr/guide/7_5/shards-and-indexing...

6) Extraction pipeline, they are doing all together. We have:

a) pre-Solr extraction (usually done in a stand-alone client, though we do include Tika and DataImportHandler for quick start),

b) in-Solr pre-schema processing with Update Request Processors https://lucene.apache.org/solr/guide/7_5/update-request-proc...

c) Actual per-field text processing pipelines, separate both for index and query (they call query part later "query understanding": https://lucene.apache.org/solr/guide/7_5/understanding-analy... Also, my own: http://www.solr-start.com/info/analyzers/

7) Pluggable internal index formats? Here is the latest (FST50): https://lucene.apache.org/solr/guide/7_5/the-tagger-handler....

8) Update system configuration live, over API? https://lucene.apache.org/solr/guide/7_5/configuration-apis.... https://lucene.apache.org/solr/guide/7_5/schema-api.html

9) Tolerate small failures, but abort if something is definitely not right: http://www.solr-start.com/javadoc/solr-lucene/org/apache/sol...

10) Retrieval root: https://lucene.apache.org/solr/guide/7_5/solrcloud-query-rou...

11) Retrieval leaf: That's Solr's basic shard/core

12) The inverted and forward indexes look like standard Lucene index and maybe docValues: https://lucene.apache.org/solr/guide/7_5/docvalues.html

13) Search orchestrator seems to be a couple of features on top of Solr's existing routing linked earlier. There were individual approaches/3rd-party modules doing some of these (shadow, federation, ACL). Some of this is definitely unique to Dropbox though.

14) Precision vs Recall vs Ranking is too many links, but there is a whole book on this: https://www.manning.com/books/relevant-search (mostly about Elasticsearch, but Solr has added some new features recently to make it even better)

15) BM25, we had it back in 2015: https://opensourceconnections.com/blog/2015/10/16/bm25-the-n...

16) (Future for Nautilus): Distance Based embeddings, such as Word2Vec. Commercial offering on top of Solr has it: https://lucidworks.com/2016/11/16/word2vec-fusion-nlp-search... but I remember discussion for Solr as well

17) (Future for Nautilus): Searching images/videos/etc: https://lucidworks.com/2015/08/28/shutterstock-searches-35-m...

And a lot more (Solr Reference manual is more than 1300 pages....).

Obviously, this is a bit of a dig at a Dropbox reinventing the wheel again (or perhaps this time actually using Lucene, but forgetting to attribute it so far).

But more importantly, it is a message to others that got excited by their architecture post. You can have a similar battle-tested system for yourself, for free. And if something is not perfect, you can fix it and help the rest of the world too. We are always happy to see new contributors.

Finally, if you know Apache Solr well, it is not just Dropbox you can work for, but also Lucidworks, Bloomberg, Cloudera, Alfresco, Shutterstock, Dice, CareerBuilder, and many others.

decasteve · on Sept 28, 2018

> at least they are using Tika

Why is that important? Is it advantageous versus the alternatives? (Genuinely curious)

I have been using GNU libextractor but I see Tika quite often brought up in the same breath. When I tried Tika a while back I didn't find it as good nor as fast. Has that changed?

arafalov · on Sept 28, 2018

Tika is a very active project that Solr also uses. And they rely on other good libraries.

If libextractor is sufficient for you, that's great. If you hit its limitation, try Tika.

Some use-cases I know of include

- Parsing Microsoft Office Files

- Doing OCR on images

- Running Tika as a standalone server with HTTP interface

Tika is most definitely a secret component inside a lot of systems that extract content/metadata from files. So, Dropbox leveraging Tika was a good move and worth recognizing. Especially, given that the rest of their choices does not quite make sense (based on the limited information provided).

gunnarlieb · on Sept 30, 2018

about 17) we are using a powerful image search plugin (commercial) which does well for us https://pixolution.org/

innagadadavida · on Sept 29, 2018

> So, instead of contributing to scaling (like Apple and Bloomberg and Cloudera did),

Apple has 2-4 different search stacks. The web search one is fully homegrown and closed source.

arafalov · on Sept 29, 2018

Entirely possible. Yet, this is what Apple presented in 2014:

Jessica Mallet from Apple, Inc. gave a presentation on how Apple uses SolrCloud. She briefly outlined some terms and concepts and then dug into how Apple built a multi-tenant search platform with each cluster holding around one million logical indexes. She also explained how their automation tool SolrLord uses alarms to trigger several events and can fix issues without any human interaction.

https://www.youtube.com/watch?v=_Erkln5WWLw

sagichmal · on Sept 28, 2018

It's OK to not like Solr, which is a large and extremely old codebase.

arafalov · on Sept 28, 2018

It's OK to not like Solr. At the same time, half of the features I listed above, are quite new (SolrCloud, docValues, LTR, Config and Schema API, JSON support, etc).

And the other half - the 'old' part you may not like - are battle-tested, multiple-times speed-optimized pieces of code. Like this one: http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is... Moreover, their architecture makes very clear that they are making very similar choices, it is just their implementation is much fresher.

Sure, there is crud in Solr, it is an open-source product driven by the user needs. Sure, it is possible that - for some usecases - Java is disadvantage.

I would have loved that refreshed comparison to be in the article. It is very jarring that it was not. As it is, it felt that they walked away from 2015 and have not looked since. Even though their "simpler" approach did not work out and they had to throw it away.

innagadadavida · on Sept 29, 2018

Any idea what language they used to implement these in?

arafalov · on Sept 29, 2018

They only mention Tika and Kafka, both I believe are written/using Java. I think the next article is supposed to give more details, I am looking forward to that.

markpapadakis · on Sept 28, 2018

I am very much looking forward to forthcoming posts describing the actual architecture and specifics -- this is a great high-level overview, but I hope and expect they will expand on this expose soon.

tegansnyder · on Sept 28, 2018

Is this based on Lucene in any way?

tomrod · on Sept 28, 2018

Cool to hear about a revamp. It confuses me why some projects use a well-known open source name. In Linux, it's primary desktop environment (Gnome) uses Nautilus as a file manager. Dropbox even has a package for Dropbox/Nautilus integration.

Insanity · on Sept 28, 2018

Yeah whenever I hear Nautilus, the first thing that comes to mind is the file manager.

But I suppose most dropbox users are on mac/windows.

hackandtrip · on Sept 28, 2018

What alternative have Linux users to Dropbox, to using their own server? (That is, for a number of reasons for most people, suboptimal)? You cant really use OneDrive365, and Dropbox offers vast support to Linux, is easy to set and can be used for free too. Is there a reason why Linux users wouldn't use it? Asking for curiosity.

groovybits · on Sept 28, 2018

> Dropbox offers vast support to Linux

Dropbox only supports unencrypted ext4 filesystems on Linux, so I would not use the phrase 'vast support'.

coldtea · on Sept 28, 2018

Because there are many popular desktop options for Linux that don't use ext4 and non-encrypted as their default?

yjftsjthsd-h · on Sept 28, 2018

RHEL family is XFS-centric, and quite a few distros (including Ubuntu) offer encryption in the default installer.

danieldk · on Sept 28, 2018

ext4 on encrypted devices, such as dm-crypt/LUKS, is supported. What is not supported are encryption filesystems that are 'filesystem overlays', such as ecryptfs.

(Since I was using ZFS, I am still debating whether to stay with Dropbox after November's filesystem apocalypse.)

module0000 · on Sept 28, 2018

FYI, you can still use ecrypt fs with dropbox. Put the encrypted store within your dropbox, and mount it outside your dropbox. From the dropbox point of view, you have thousands of files with gibberish as names.

barrkel · on Sept 28, 2018

Every desktop Linux user could use Dropbox and still most users of Dropbox would probably be Windows and Mac.

dordoka · on Sept 28, 2018

OwnCloud supports Linux [1]. If you want a SaaS version, they have several hosting partners [2].

[1] https://owncloud.com/client/ [2] https://owncloud.org/hosting-partners/

pritambaral · on Sept 28, 2018

Better go with Nextcloud than ownCloud. Nextcloud is the fork by the original developer team, and has quite a few nice improvements compared to ownCloud (e.g.: video and text chat, e2e encryption)

dordoka · on Sept 28, 2018

Didn't know about the fork. Thanks for the suggestion, will check it out.

tombowditch · on Sept 28, 2018

> Is there a reason why Linux users wouldn't use it

https://www.theregister.co.uk/2018/08/14/dropbox_encrypted_l... ?

Ylodi · on Sept 28, 2018

Seafile (https://www.seafile.com/en/home/).

reacharavindh · on Sept 28, 2018

There is Syncthing which looks good on paper, but it lacks an iOS client. I'd love it if someone developed even a read-only iOS app for it.

cpburns2009 · on Sept 28, 2018

I use SpiderOak One [1] which is a privacy focused alternative to Dropbox. I run it on Ubuntu (and previously on Debian and Arch Linux). There's no free tier like there is with Dropbox though.

[1]: https://spideroak.com/one/

minikomi · on Sept 29, 2018

For a Linux user, you can already build such a system yourself quite trivially by getting an FTP account, mounting it locally with curlftpfs, and then using SVN or CVS on the mounted filesystem. From Windows or Mac, this FTP account could be accessed through built-in software.

Insanity · on Sept 28, 2018

Yeah but that is a different group.

"A large percentage of linux users, use dropbox" does not equal "A large percentage of dropbox users, use linux".

lyk · on Sept 28, 2018

keybase's kbfs is my dropbox-style replacement. It works really well in my experience

danieldk · on Sept 28, 2018

kbfs is a network filesystem, whereas Dropbox provides file synchronization. kbfs does not work when you are offline, whereas with Dropbox the files are always available locally on your machine.

mynewtb · on Sept 28, 2018

Seafile is great

mywacaday · on Sept 28, 2018

I'm mostly a windows user but have some exposure to Ubuntu and had never heard of Nautilus

phalangion · on Sept 28, 2018

It's funny because the only reason I know of Nautilus on Ubuntu is from installing Dropbox. To use the Dropbox daemon on Ubuntu you have to install it and then restart Nautilus.

Ylodi · on Sept 28, 2018

You never used a file manager? That was a tiny, little exposure. :-)

cpburns2009 · on Sept 28, 2018

I think Ubuntu renames Nautilus to "Files".

feborges · on Sept 29, 2018

GNOME does.

M_Bakhtiari · on Sept 28, 2018

That's why open source projects need to register their trademarks. That's how Gnome managed to stop Groupon from ripping off the name for their own project.

acct1771 · on Sept 28, 2018

Clear disconnect from the open source community, which isn't a negative signal, but also is not a positive one.

justtopost · on Sept 28, 2018

Yeah, its enough that I am actively disinterested. Seemz like the opposite of goodwill.

danShumway · on Sept 28, 2018

Yeah, this seems unnecessarily confusing to me. It's an internal project, so it's not like the name matters for advertising or anything.

Surely it's just common courtesy to not step on top of an actively developed, very popular project that is directly related to file management.

buboard · on Sept 28, 2018

Isnt this a proprietary backend program that users will never know about? Who cares