Hacker News new | past | comments | ask | show | jobs | submit login
Machine learning saves us $1.7M a year on document previews (dropbox.tech)
201 points by wsuen on Jan 27, 2021 | hide | past | favorite | 52 comments



Unfortunately it won't save me from unsubscribing because of the ever increasing upsell tactics and dialogue prompts by the Dropbox application. I just want a simple file storage :(


I just did the same. I moved to Google, even though I’m no fan. Cost is about 70% less.


They won’t do that because simple file storage is always “abused” and will have to be shut down. There always has to be stressers to limit use cases to casual ones.


Abused how? By using their paid quota?

I know "unlimited" options are abused, but Dropbox isn't unlimited?


Whatever ways incompatible with business models, not necessarily with hostile/malicious/abusive intent. Maybe "abused" can be replaced with "used for procedurally generated data" or something.


If I rent 2 TB and use 2TB it doesn't matter what I use it for, does it?

I would agree if we talked about Backblaze with "unlimited" backup for next-to-nothing, but this is Dropbox which is rather pricey and also limited to a very specific number of (Tera)Bytes.

If they cannot deliver on their promises, shame on them.


The service offered is something mouthful like "document cloud storage for sharing over the Internet billed for up to 2TB space", not 2TB of raw disk space.

And backend is always S3+Glacier which costs about $25/TB/month for storage and $100/TB for download, so not just that filling up your quota to the brim costs them more than twice as much as the what Dropbox charge for a 2TB quota, but overwriting and retrieving the files is going to seriously hurt their budget. Not your budget, none of your business, but that's how it works, that's why anyone other than them went bankrupt, and that's why they don't try to be nice to advanced users.


Last I heard they aggressively deduplicate across customers and also shared folders count towards the quotas of both sides, at least that is how it looked like to me.

Edit: But most importantly: if you sell me "x units of anything" it is not abuse if anyone use whatever they thought they bought.

I'll discuss it if a highly technical user keeps "trashing" it by uploading frequently changed and unique data for no good reason (like in trashing a cache) etc, but ordinary users, filling their quota to max? No way.


mega.nz filled that gap for me.

Gloriously boring, no flashy corporate identity. I give them money and the give me file storage/sync. One or two bugs and questionable UI choices, but it just works and I don't see them doing a pointless chain of redesigns or ads.


I am sick of Dropbox advertising to me and am going to cancel my subscription. I wish there was a “leave me alone” option for paid accounts, I was perfectly happy in the days before Dropbox decided it was necessary to remind me of its existence every few days.


I have a paid account, and use it a _lot_, but I'm planning to unsubscribe as well.

I'm tired of constantly having to re-learn where the upsell buttons will appear, so that I can avoid them when I actually use the product that I paid for.


Hi folks, author here. I am very excited to share this post about how we use machine learning at Dropbox to optimize our document preview generation process. Enjoy!

I'll be online for the next hour or so; happy to answer any questions about ML at Dropbox.


I work in an innovation ML-oriented lab, and we have a hard time identifying use cases with real added values.

So I wondered: who had the initiative to use ML in riviera, the riviera team or the ML team? How do you collabore between the two teams/worlds (production team and data science team)?


Hi setib, great question. The original idea to use heuristics for preview cost reduction came out of a Hack Week project. This led to an initial brainstorm meeting between the ML team and the Previews team about what this might look like as a full-fledged ML product.

From the beginning the ML team's focus was on providing measurable impact to our Previews stakeholders. One thing that helped us collaborate effectively was being transparent about the process and unknowns of ML (which are different from the constraints of non-ML software engineering). We openly shared our process and results, including experimental outcomes that did not work as well as planned and that we did not roll into production. We also worked closely with Previews to define rollout, monitoring, and maintenance processes that would reduce ops load on their side and provide clear escalation paths should something unexpected happen. Consistent and clear communication helps build trust.

On their side, the Previews team has been an amazing ML partner, and it was a joy to work with them.


I'm curious to know the answer to this question as well. I have done a fair bit of work working with organizations to identify ML use cases. When we looked at it from a business process perspective, honestly it didn't go very well. Trying to find company process specific interventions, especially in the format of building a funnel to priortize which to move forward, rarely surfaces unique or game changing ideas. We usually ended up generating a list of things where either ML played a minimal role, something more simple would have been better, or you'd need AGI.

What I've seen work better is a product approach, where ML is incorporated as a feature (rarely but possible the centerpiece) of a full solution for an industry that provides a new way of doing something and the value that comes with it. The caveat is that this is hard and takes up front R&D and product market fit research that any product would. It doesn't happen in a series of workshops with representatives from the business.

This Dropbox story is an obvious counterexample, and really looks like the mythical "low hanging fruit" that we always want to identify in ideation workshops. But I'd be careful trying to generalize a process for identifying ML use cases from it.


I’ve worked on ML across several large e-commerce firms, and two patterns I have seen along the lines of your comment:

1. many organizations dismiss ML solutions without actually trying them. Rather, if one hack week style prototype doesn’t work on the first try, it’s chalked up to “over hyped AI” and never sees the light of day. Organizations that succeed with ML don’t do it that way. Instead they ensure the success criteria are stated up front and measured throughout, so you can see why it didn’t work and iterate for v2 and v3. “We spent a bunch of R&D expense on magic bullet v0 and it didn’t succeed immediately” is a leadership and culture problem - you probably can’t succeed with ML until you fix that.

2. Many companies have no idea how to staff and support ML teams, and go through various cycles of either taking statistical researchers and bogging them down with devops or taking pure backend engineers and letting them do unprofessional hackery with no clarifying product quality ML expert in the loop.

You need a foundation of ML operations / infra support that cleanly separates the responsibilities of devops away from the responsibilities of model research, and you must invest in clear data platform tools that facilitate getting data to these teams.

If an org just figures they can throw an ML team sink or swim into an existing devops environment or they can require an ML team to sort out their own data access, it’s setting ML up for disaster - and again you’ll get a lot of cynics rushing to say it’s failing because ML is just hype, when actually it’s failing due to poor leadership, poor team structure and poor resourcing.


Yeah, for one thing, the scale of Dropbox may make this a uniquely worthwhile investment for them. Many apps have similar kinds of speculative caching features that could be optimized with predictive modeling, but the same cost-benefit analysis might show that it saves $1.7k/year instead of $1.7M, less than the cost of the feature's development and maintenance.


@andy99, @setib: we're a boutique that helps large organization in different sectors and industries with machine learning. Energy, banking, telcos, retail, transportation, etc. These organizations have different maturity levels and their functions expect different deliverables.

The organizations range on the maturity level from "We want to use AI, can you help us?" to "We have an internal machine learning and data science team that's overbooked, can you help?" to "We have an internal team, but you worked on [domain] for one of your project and we'd like your expertise".

For the expectations, you can deal with an extremely technical team that tells you: I want something that spits JSON. I'll send your service this payload and I expect this payload. So that's a tiny part.

Sometimes, you have to build everything: data acquisition, develop and train models, make a web application for their domain experts with all the bells and whistles, admin, roles, data management, etc. I wrote about some of the problems we hit here[0].

The point is that, finding these problems is an effort that requires a certain skill/process and goodwill from the clients. We worked on a variety of problems.

- [0]: https://news.ycombinator.com/item?id=25871632


I've gotta run, I'll take a look later if other questions come in!


Haven't read the doc. But is it just me or does that seem tiny, considering how large dropbox is?


It does seem tiny, and my first thought was "how many dollars did they burn to save those $1.7M", but that was one of the first things they evaluated, and both the research phase and operational burden of running the service seem to be relatively small so that the investment definitely paid off. It's great that they're talking real numbers, loved the post in general!


Is the cost saving really measuring the right thing? Instead of comparing against the cost of pre-generating and caching every file preview, shouldn’t they be comparing against he cost of adding enough infrastructure (or just optimising their preview code) to make on-the-fly preview generation acceptably fast?


Hi joosters, thanks for the question. It is always a good practice to ask if we are solving the right problem.

The decision to prewarm is ultimately a product decision to give users a better experience across the many surfaces where they encounter file previews. There are limitations to how fast previews can be generated on the fly, even under optimal conditions (unlimited top-of-the-line hardware, max of 1 request at a time). For instance, optimal on-the-fly preview generation for more expensive files (say, a video that is an hour long or a gigapixel image) can add 10s of seconds to tti -- not good from a user perspective!

Given this constraint, we wanted to optimize where we spend on the preview generation infrastructure without negatively impacting the user experience. We chose to do this with a combination of heuristics and the ML solution described in the article.


Couldn't you just load the first frame of the video, and aren't most image formats optimized for fast thumbnailing? I take your point that there are times when it will be slower, but are you saying that you assumed ML would be faster than on the fly, or did you actually check?


> Couldn't you just load the first frame of the video

No, often it's black. You have to do scene-detection if you want meaningful previews/thumbnails, can't always go random about it.

And the original files are stored in HDD and not SSD/RAM.

Source: I don't work there.


Hmm.. how does it compare to a heuristic to say only generate previews for recently uploaded/changed files for active users who may share a lot?


Dropbox are currently laying off 315 employees but I imagine any process improvement that can save 5-10 full time employee equivalents is more than welcome.


What kind of "signals" are being fed into this model? Are we talking like, scroll position relative to a file, and things like that?


A similar problem previously written about for Google Drive (which files to suggest to the user and possible prefetch)

Quick Access: Building a Smart Experience for Google Drive (2017) https://research.google/pubs/pub46184/

and a recent follow up

Improving Recommendation Quality in Google Drive: https://storage.googleapis.com/pub-tools-public-publication-...


This doesn't seem like enough savings to justify paying a team to maintain it.


> We used the “percentage-rejected” metric minus the false negatives to ballpark the $1.7 million total annual savings.

I think this may too sanguine for the false negatives in that it ignore latency sensitivity. Generally, batch processing (like preview generation during pre-warming) is cheaper than latency sensitive processing (like preview generation when the user is waiting for it). If you don’t take that into account, you can be misled by your cost metrics.


Hi there. Fortunately for the ML team, the Previews team kept a detailed cost breakdown for different preview types - including on the fly generation cost, async generation cost, and cost to serve a preview. Our ballpark accounted for some of the varying costs, though the difference is not significant.


The technical details are interesting but the emphasis on "1.7M savings" screams misdirection of resources, considering the salaries of SWEs/ML engineers and more importantly opportunity cost of them to deploy to an optimization task.


It does sound like it would barely have broken even when considering the opportunity cost of the highly-compensated developers who had to write it, which they ignore in the article. It goes against the "rules of thumb" i learned at Google, which suggest that an engineer would break even if they saved [redacted but huge number] CPUs per year, and should only choose problems that promise to save 10x that or better.


Hi mehrdada. In the article, we discuss how to evaluate tradeoffs of ML projects. One of these tradeoffs is cost of development and deployment vs. cost of not developing a solution. In our particular case, the tradeoff made sense.


We're looking at this from a very narrow lens. A few of these wins a year in a small team will end up paying for itself fairly quickly.


Let's say X engineers making 200k a year work on this. This is a 5x return on your money in 5 years if it took 8 people working the full year to complete it. Sounds like a solid business case to me.


You also have to keep in mind that this work was interesting and it probably helped to keep few rockstart engineers from leaving.


Agreed. At this level of competition for top talent you need to pamper your staff, which includes letting them work on the kinds of projects that they’ll proudly mention on their CVs, even if it’s not the highest priority project business-wise. I am sure a company like Dropbox can afford to do this once in a while.


That makes sense if all engineers are fungable, and you are never limited by engineering capacity.


You also need to subtract the value those 8 people could create if they worked on something else.


You don't - If you do that you are double counting the cost of the engineers (If they were working on something else, you wouldn't account their salary against this optimisation).


This is unknowable


Was totally going to point this out as well. It's entirely possible that their ML team cost nearly this much to begin with.


Setting aside the Dropbox-proprietary services, I'd be most interested to see what features/categories you picked from the file metadata, how those were prepped or rotated for training, what training/prediction algorithms you tried and how you picked a winner. Also curious to see the note that observing the "reason" was important here--isn't this a case where a total black box would be sufficient?


FYI there's also the off-the-shelf https://filepreviews.io for this


cool product, but the blog hasn't seen a post since 2017. is it active?


it's a side project of active OSS hero jpadilla https://github.com/jpadilla.

the service is active but he doesn't do marketing for it. he did talk about it on a podcast a couple months ago https://jpadilla.com/2020/01/10/podcast-djangochat-ep-45/


I have some questions.

How much will the false negatives cost in customer satisfaction?

How many FTEs will you need to maintain and improve this?

How many more customers did you get from the PR?


The more interesting problem is learning doc -> image directly. With Dropbox's scale of data, seems feasible for some data types.


I'm sure that couldn't leak anything important.


Ok not related, I wonder how much paper we could save by showing (by default) a print preview. 1.7m trees? Profit for humanity...




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: