Hacker News new | past | comments | ask | show | jobs | submit login
How we spent $30k in Firebase in less than 72 hours (hackernoon.com)
307 points by slyall on Aug 1, 2018 | hide | past | favorite | 237 comments



When you have an unexplained performance problem, your response shouldn't be to "upgrade every single framework and plugin" that you use. The 36 hours that they spent doing this cost them $21,600 dollars on GCP and didn't solve their users' problem.

Understand the services you depend on. Track the number of requests you're making to them, how long they're taking, and how many are failing. Reason through your system and look at the data when you have issues, rather than grasping at straws.


That jumped out at me too.

Obviously they were under a lot of pressure and it's easy to stand here and criticise, but...

...if my site is slowing down with load or usage, I'm not sure how you make the jump to "I should update my UI libraries!". Angular 4 isn't getting any slower, so best case is you've got some unknown performance bottleneck in your UI that is somehow causing 30s page load times, and which just happens to be fixed in Angular 6, and you don't accidentally add any new issues when you upgrade.

Conversely, it feels like if you're struggling with "slow load times" on a SPA, the first thing you'd do is open the network tab and see what requests are being made, to what, how often, and how long they're taking.

Grasping at straws does seem to be the right metaphor. (Or maybe the old chestnut about the drunk dropping his car keys in a dark parking lot, then looking for them under a streetlight, since it's too dark to find them in the parking lot?)

I'm happy for the team and it sounds like things are going great for them, but wow, that was an almost fatal bit of blindness. On the plus side, I bet everyone involved will check for inefficient database calls first next time. :)


That’s why you need good humans who’re expert at responding extremely well when things break. It’s one thing to prep for a whiteboard interview, it’s another to intuit that needle in the haystack.


It's only about finding needles in haystacks if you don't have sufficient instrumentation and monitoring in place. For most production issues your tools should be able to guide you at least the first 75% of the way to your issue, even if they can't usually offer a good fix (though sometimes they can, such as when new relic points out missing db indices causing unnecessary full table scans).


I agree. It's also about trusting developers' intuition when debugging problems. We recently (past few working days) went through something similar; where we had a problem blocking us from a release, and people scrambling to figure out the problem.

We have some software that was returning different results from different environments, and we couldn't figure out the problem. There was a lot of panic in the room, from upgrading and downgrading Maven dependencies, building things inside and outside of Jenkins, and all sorts of random things.

We kept telling the project leadership that we're poking at the wrong part (intuitively), but they kept pushing. I've had to explain how Maven works, how building on Jenkins doesn't differ to building from our IDE's, etc.

It's only when we asked for isolation from the (human) elements, that we had the freedom to properly debug.

In the end, an unstable sort was the cause of the issue. We were taking the last element from an array, but not sorting the array first.

All of the stuff we did since last Thursday to Tuesday evening didn't help us.

So, I agree, you need good humans who are good at responding well when things break.


I've been in situations where the whole developer team is trying to solve a production bug, and the managers have never tried to tell us where to focus.

That wouldn't make any sense at all.

Every developer would have their own guesses that they would need to explore and validate, sometimes there's some grouping around where the focus is, but there's usually one guy that's exploring a totally different area to find that bug.


On the other hand I've spent hours debugging an issue in a C++ program I work on only to eventually find that it was a bug in the oldish version of GCC we were using and simply upgrading the compiler would have fixed it.

It's entirely possible that they could have spent hours or days debugging their issue only to find it had already been fixed.


But after those hours you knew exactly what the problem was and exactly what the solution was. You were in a much better position and could make better judgement calls at that point. It's better to dive in and figure out a problem than to just randomly upgrade things.


This isn’t really the other hand. This is saying exactly what OP is. Debug your system and fix the bug when you find it. Don’t randomly change things hoping the problem goes away.


> Reason through your system

modern stacks have a huge opacity problem, everyone wants to be magic, and everyone fails. abstractions make reasoning harder, what tools and techniques would you suggest for doing this?

I'd probably run the application in some sort of sandbox and measure the outbound request load vs inbound request load, something a containerized deployment should be giving the end user (developer) as an affordance for application maintenance and visibility. Differential analysis and graphing built directly into the execution substrate.

Judge away!


Basic metrics around platform API calls would have done the trick, no need for fancy container solutions.

edit: I would have also assumed they would get this for free from GCP's billing breakdown, but I'm not familiar with it. My first intuition when facing unexpected billing would be to figure out what the major contributor to the bill is (in this case, massive reads from FireStore), not update my frontend packages.


Exactly. Not having proper tooling meant they didn’t know what to do. When your check engine light comes on, you don’t replace every system in your car, you get out a scanner and check the code.

Google cloud has trace built in which could have shown them execution times and is dead simple to drop into most frameworks.

The real story here is that hey didn’t have engineering leadership on the team who knew how to properly diagnose issues, put tooling in place before launch, and understand how their system is architected.

Kudos to the engineers for solving this issue under pressure.


>"When your check engine light comes on, you don’t replace every system in your car, you get out a scanner and check the code."

^ Best response, hands down.


> My first intuition when facing unexpected billing would be to figure out what the major contributor to the bill is (in this case, massive reads from FireStore), not update my frontend packages.

They didn’t upgrade packages to solve the mystery billing. They upgraded packages before they checked what was going on with the database. When they saw the high billing that pointed them to the problem and they fixed it.

There was some questionable judgement shown by not checking db requests first, sure, but in no way did someone think “our google billing is high, we better upgrade angular”.


> modern stacks have a huge opacity problem, everyone wants to be magic, and everyone fails.

Also, lots of people are impatient and/or intellectually lazy. We have piled up a ton of abstraction layers, yes, but they aren't hard to pry apart. But people want immediate results without doing necessary cognitive work - understanding-guided exploration.

It usually isn't hard to identify which component of your product is misbehaving. Before getting into complex magic of containers and sandboxes and stuff, I'd start with the easiest - looking at the Network tab, at your server's resource use, reading the logs, adding some log statements measuring times in suspected areas, actually profiling the backend code (e.g. with a statistical profiler). This should quickly help you identify where the problem is manifesting itself. Then the search for the cause begins.


There is also a common sense problem. Not trying to play the blame game but the guy who ran a subtotal displayed on the front page by having to download every row in the database probably isn’t a great programmer. It’s easy to say don’t write bugs, but there are bugs and moronic design decisions. This to me doesn’t feel like a real bug.


There's a good chance this is true, but this can happen even with competent developers. One person writes the unoptimized function figuring it can be improved later, someone else assumes this must be cheap and drops it on the front page. Or there was meant to be a caching layer, that didn't work because of some bug.


>Also, lots of people are impatient and/or intellectually lazy. We have piled up a ton of abstraction layers

Sorry, the guy that wrote/implemented abstraction layers 2 and 3 left 2 years ago and didn't document anything.

We've been understaffed for a year and we've been told not to hire any more staff until the new financial year.

I've got a "technical debt" item on the backlog but business drives the priorities and it'll never get done.


> Sorry, the guy that wrote/implemented abstraction layers 2 and 3 left 2 years ago and didn't document anything.

Only matters when the fault is located or manifests itself precisely in that layer, and that's always a risk. Consider all the third-party dependencies you use. They usually have only their APIs documented. Fixing a fault usually requires knowledge of the internal implementation.

> We've been understaffed for a year and we've been told not to hire any more staff until the new financial year.

> I've got a "technical debt" item on the backlog but business drives the priorities and it'll never get done.

Yeah, I get that. I've seen that. Thing is, you can only play lottery so long with your main product - and also, if your workplace runs an assembly line so tight that you can't spend hours thinking per ticket (excluding the most trivial ones), then something is seriously broken on yet another level.

Ultimately, I guess what I'm saying is that the main problem here is cultural - possibly both on developer and management side. The actual technical tasks aren't usually that challenging.


Well a browser is basically a sandbox for a webapp.

To debug something complex, I would use chrome devtools, which can measure all kinds of metrics, and the function "Audit > LightHouse" automates the process and ranks a webapp in several key categories.

This case would appear to be related to network requests, so that issue should be fairly obvious in LightHouse.


> Understand the services you depend on. Track the number of requests you're making to them, how long they're taking, and how many are failing. Reason through your system and look at the data when you have issues, rather than grasping at straws.

This is a good thought-process even when debugging issues during development. I've seen many developers attempt to "fix" issues by trying to figure out what dance/keystroke makes things work.

Whenever you encounter an issue of any kind, anywhere, understand the issue before attempting to resolve it. It may require you to dig deep into things you don't currently understand, but your career is currently telling you that you need to understand it.


>When you have an unexplained performance problem, your response shouldn't be to "upgrade every single framework and plugin" that you use.

Yeah. That jumped out at me as well. They spent an inordinate amount of effort to solve a non-problem. It's great to stay on evergreen with versions, but probably not a good thing to do so while you're desperately trying to debug a problem.

I suspect this was a hopeful but lazy attempt - in the spirit of "Maybe if we just do this, it will somehow fix the underlying problem". It's a lazy approach to solving problems. Debugging performance bottlenecks is hard and devs generally hate doing it. Upgrading version dependencies is a known factor and developers are comfortable with that.


There has been MANY times, upgrading outdated libraries or software versions resolved similar performance issues for me. Should you just immediately run update all? No. But acting like that's not a viable solution is silly.


>But acting like that's not a viable solution is silly.

Usually you want to understand the problem before solving it. In this case, they wasted a bunch of time doing a bunch of things (upgrading all the dependencies, and refactoring the app) in the hope that something (ANYTHING) they're doing hopefully fixes a problem they don't understand. Smart move?


Everyone can sit back in hindsight and act like armchair developers. When you're in the shit, sometimes you make not so smart moves, now they've learned from their mistakes. Which is why they are talking about it.


>Everyone can sit back in hindsight and act like armchair developers.

That's one perspective. But come on! I really don't understand that attitude that when presented with a problem, the first approach is to spend a few days blindingly refactoring code and upgrading all the underlying frameworks. Seriously?

The problem is also obvious if you just stop and think about it for a second:

- They are using Firebase. For the purposes of diagnosing our issue we can assume the backend will scale well (for trivial queries) and the pipe between server and client should be wide. Firebase could be the problem, but odds are Firebase didn't go down on you just as your go-live went ahead.

- Because they are using Firebase, their app is completely client-side.

You can go through the potential areas of concern:

1) UI has trouble rendering. That should largely be independent of the number of users. If this was only a UI issue you'd expect some users to have problems (maybe ones that created a large amount of artifacts) but not all users. Presumably before going live, the app worked well with their test datasets.

2) Some combination of UI or Network or Data model. They noticed their web-app got slower as the number of users grew. So question is why would individual user session slow down as the total number of users grow? It must be that a single-user view is somehow dependent on the total number of users in the system. WHY?!? We know Firebase is fast, but any fast system can choke if you have a bad data model. So it could be a slow query. Or it could be a too large of a response being sent down (again, why would a large response be sent down). Maybe it was a huge json object and the UI locked up. Or something like this.

It really shouldn't have taken long to at least target potential areas to explore. HELL, you should be able to see the issue immediately if you open up the network tab. You'll see which requests are either taking forever, or lead to large amount of data being transferred or both.

It really isn't about 'armchair developers'. I've been in situations where things are falling apart and you need to figure shit out. Our product is on-prem and used in hospitals and is connected to multitudes of other systems controlled by other vendors. When you're trying to diagnose issues, you have to have a rational approach based on some reasonable hypothesis.


To be fair I recently upgraded an old PHP install from 5.5 to 7.2 and it sped up the site tremendously. My workflow while troubleshooting is now always going to involve upgrading software.


While upgrading should be an option, just don’t do it as a first step. Changing any dependency comes with risks and moving from version 5.5 to 7.2 of anything screams danger to me.

Sure, there are times when it’s going to work out for you but you should at least have narrowed down your issues before you go down that path.


I mean, it can be fairly safe. When I moved from 5 to 7.0 I simply left the PHP-FPM configuration for 5 around and 5 itself installed. It still is.

I can just spin up PHP5, 7.0, 7.1 in case anything goes wrong without delay.


Yeah I don't get it either, if you have performance problems you look at the network tab and see what's up, instead of yolo-upgrading your framework with fingers crossed and hope it works. Measure before fix.


Yep with performance like that it’s highly likely going to be a problem involving the persistent store of the application. The first instinct should have been to enable some tracing or logging to see where the time was being spent.


Scientific reasoning is a rare skill. I'm regularly met with bafflement and blank stares when I suggest that we measure something to determine performance, then perform the same workload, on the same hardware, after making a change.

I've literally had people try to make conclusions on comparisons of test runs with completely different parameters, different data sets, different resources, different versions of code, absolutely everything varying.

My head just explodes... I wamt to scream that this isn't how this works, it's not how any of this works.


Unfortunately I often see developers asking for complex diagnose in urgent emergencies. When it's literally like a 30 minutes reaction time frame. So from business pov there should be first quick and dirty unperfect analysis which is necessary for emergency decisions. Sometimes you can just prolong caching time as quick fix, but sometimes there is no such option. Programmers like nice and calm working environment but business often is very different place. So, please have in mind what business goals are. And yes, I see there is a lot of stupid managers in It and basically you're right, we should use precise data. Just one point: when available...


The first thing I'm likely to say to my team when an outage occurs is something along the lines of "stop the bleeding". That might mean bypassing the affected service if possible, rolling back a recent release, or reducing the amount of traffic (we're fortunate enough to have most of our traffic coming from sources we can throw a kill switch on).

However we go about it, the first priority is to give ourselves some space to properly analyse the issue and find the real solution without the rest of the business worrying loudly about things being broken.


Managing those experiences takes effort, otherwise you're sure to end up with the issues you describe. Even when the data makes sense, sometimes people hand you over an unparsable csv with their results, while the conclusions comes from a gut feeling. Sadly not everybody understands standard tools like boxplots, and analyses are often hard to reproduce.

The tooling to make it easy not there yet.


I think you have a very valid point there, but I would argue it is also important not to react to such situations too strongly. This is something April Wensel talks about here [1] .

[1] bit.ly/2v37AzE


I mean this in the nicest possible way: I see this all the time with the JavaScript set and I am absolutely not the least bit surprised. I used to work on a team that was TypeScript top to bottom, with people who didn’t really even understand how to debug (they were mostly bootcamp juniors). Whenever something would break, if restating it didn’t work, you know what they’d try? Yup, upgrading random dependencies. Refactoring was also pretty popular, although it usually just ended up making things more complicated.

It’s silly to suggest that JavaScript itself is somehow responsible for this. It’s obvioisly just a tool. But I have to say, the most professional cluelessness I’ve ever encountered was in the JS ecosystem.


I concur. I’m having a hard time not coming off as a dick while trying to opine on how amateurish I think this is. I like JS but the devs and ecosystem leaves me wanting.


The sad thing is that this is in my experience the norm. Not just in frontend/js either. It's everywhere.


> the most professional cluelessness I’ve ever encountered was in the JS ecosystem.

A testament to how low the barrier of entry has gotten. It's both a good and bad thing at the same time.

It however leads to having to be ever more so vigilant about at least your first layer of dependencies in that ecosystem, if you do want to be professional.

The higher the barrier of entry is to a language, the more likely it is that when you're pulling in dependencies, the code isn't amateurish

Incidentally this is probably why JS juniors think to upgrade dependencies when they encounter unknown situations... a lot of problems in JS do come from your dependencies.


It's been a long time goal of the software dev community to lower barriers to entry through bootcamps and such, in an attempt to "democratize software".

And the result is unsurprisingly poorer quality software. So why is it a good thing?


It's a good thing because it avoids the problem of blocking people from entering for largely arbitrary reasons (i.e., country of birth; exposure to computers in childhood).

I don't think the problem of quality should be addressed by arbitrarily axing people from the field. Some sort of a standardization / accreditation seems like a better approach.


The country of birth and exposure to computers in childhood are not preventing people from becoming software engineers. I'm living proof of that, as are many people all around the world.

And axing is not arbitrary, it's generally done based on experience and know-how. Not everyone can or should become a software engineer.


Because people think bad software is better than no software


So I completely agree, but let me play devil's advocate.

If you do go and track down the problem in your depedency and file a bug, one of two things is likely to happen: they close it and say it's fixed in the latest version or they refuse to accept your bug because it's filed against an old version.

Skipping the track it down part and just jumping into upgrading can be a time saver. It works fairly well if you fit into the 'common' part of the user base with frequent updates. (Incidentally dependencies with frequent updates are kind of a pain)


> If you do go and track down the problem in your depedency

This is the part the parent's cow-orkers didn't perform. There's nothing wrong in updating a dependency to include the fix for the problem you're experiencing. But the people in question were apparently too lazy/clueless to even track down the problem, opting for randomly upgrading stuff instead.


>If you do go and track down the problem in your depedency and file a bug

There's a difference between upgrading your dependencies because you traced a problem that you know is fixed in the newer version and upgrading your dependencies because you hope it fixes a problem you don't understand.


There is a possibility you found a bug and a workaround is known/suggested

There is the possibility you are told you made a mistake in thinking it’s a bug with the library


Js is currently more visible than other platforms, also the sheer size of the community takes a role. Idiots are everywhere.

The current trend and curse of DRY and NIH is to solve stuff by adding dependencies and gluing them together. Rookies expect that some software solved the problem at hand without thinking about it. Even worse is that they even apply this to rather simple things. The problem of OP - countig items inefficiently - is absurdly common. IMHO this is the heart of the problem, the new generation is highly uneducated how to handle data.


Are you... in the JS ecosystem yourself?

I used to be in the Java ecosystem, the C# ecosystem, the PHP ecosystem... and I could have made the statement "the most professional cluelessness I’ve ever encountered was in the X ecosystem."

I think it's just an industry thing.


Having spent more than a decade working and hiring in the Java world, I'm inclined to agree.

Having spent much of the past year writing Rust and interacting with that community, I'm inclined to disagree.

My overall feeling is that it isn't only JavaScript, but it is JavaScript and languages that attract a higher caliber of developer don't suffer from this problem the way that languages that appeal to a wider demographic do.


I feel like JavaScript is a bit of a special case, because there are people out there who, 5-10 years ago, did HTML, CSS, and jQuery, and called themselves "front-end developers", and then along came Node, so they read a few tutorials because it was just JavaScript, and they got some stuff vaguely working and then rebranded themselves as "full-stack developers" - throw in Firebase and Heroku and you have people who can store data and deploy applications to the web with only the most superficial understanding of what they're doing (and yes, that's how you end up spending $30K accidentally on Firebase and thinking that upgrading Angular might fix the problem).


It has to do with the barrier to entry.

PHP, Ruby on Rails, and jQuery are other technologies that had low barrier to entry received the attention of the "unwashed masses".

This being 2018, JS has very low barrier to adoption (Have a web browser? You have a JS runtime.) and nature runs its course.


Oh come on. Tech does not need barriers to entry. It just needs debugging and improving one's code to also have a low barrier to entry.


Software engineering is very hard and can't be taught in bootcamps, blog posts or on the job. It takes years of concentrated effort and quality learning material.

A significant number of graduates of computer science couldn't do software engineering after graduation even before.

That barrier to entry is designed to protect society from poor quality software and actual software engineers from having to suffer through picking up the broken pieces after those people that were helped to jump the barrier.


Where can it be taught, then? There are no resources for this right now, mostly people are left to their own devices.

Most barriers to entry are not designed to protect anyone, they're designed to preserve power. To protect people from bad products, you need regulation, accreditation, etc.


There are very good universities focusing on computer science and software engineering in most countries.

This is the first step. Then one needs to find a company with a good engineering culture, apply the theory they learned and gather experience. Ideally one should find a qualified engineer as mentor.

Self-study and being aware of developments in the profession are the last piece of the puzzle.

Yes, some people won't be able to do some of these things and as a result they won't be good software engineers. They could still be successful programers, the two aren't necessarily related.


This seems more like what you would like to happen, not what actually happens to most people, so it doesn't really qualify as resources.


Universities ?


Most universities don't even cover something as basic as version control. Monitoring? Forget it.


Type with one hand, make your software better!


> Oh come on

Apologies if I suggested otherwise, but of course programming doesn't need barriers to entry. Just like PCs and the internet don't.


Something about jQuery does remind me of JavaScript


Most JS Devs think debugging is just console.log(), that's nice and all, but it's not connecting a debugger, adding breakpoints, and watching the data go around. The amount of times I've caught issues with just a simple game of follow the calls and objects around with the debugger.

As soon as I got to the part where they just upgraded a bunch of libraries.. I rolled my eyes, I was expecting a serious look at something, perhaps even a bug in Firebase or something in-depth. But nope, what we got was "Ooops I didn't think about the number of API/DB calls we were making because we don't think that way, we just assume everything is the fault of the libraries we use."

That kind of attitude is why I cannot wait to abandon JS all together..


> Refactoring was also pretty popular, although it usually just ended up making things more complicated.

Yeah I'm guilty of this one. Sometimes you know the problem is somewhere in a particular area of code, but that code is all over the place. Pulling it apart and refactoring it can be a good way of understanding all its dependencies. If the refactoring doesn't help, just don't check it in..


I don’t think it’s necessarily something to be “guilty” for. Especially if it’s someone else’s code I’m debugging, I agree that rewriting it “in my own words” can be a great way to understand what’s going on. Like you said, its not like you have to check in the changes.

However it sounds like people are talking about refactoring an app solely for the purpose of hoping that the refactor shakes out whatever bugs. That sounds like the debugging equivalent of “8 hours of coding saved me 30 minutes of planning”


r/Javascript/PHP/g # 7 years ago

r/Javascript/Visual Basic/g # 14 years ago

r/Javascript/Basic/g # 21 years ago

Us old/wise/thoughtful folk have denigrated the tools that young/foolish/impetuous kids use since we were they.

We need both: yes, these young people made some mistakes, but I'm in awe at what they achieved. They built, triaged and fixed a massively successful campaign in the time I would have taken scoping out the requirements. Oh, and gladhandled Google into paying the tab... impressive!

[Update: formatting]


This is why infinitely scaling pay-as-you-go cloud services terrify me.

I refuse to use a service like this unless it gives me the ability to automatically cap costs and alert me when thresholds are met.

All it takes is a rogue line of code in an endless loop or something, and you are bankrupt.

Their site seems pretty basic. I'm struggling to understand why they couldn't just run it with something like Postgres for less than $100 a month on AWS?


Product Manager for Cloud Firestore here. It's worth noting we do have the ability to set hard daily caps, as well as budgets that can have alerts tied to them. It's something we also looking at ways to improve it.


>Product Manager for Cloud Firestore here. It's worth noting we do have the ability to set hard daily caps, as well as budgets that can have alerts tied to them. It's something we also looking at ways to improve it.

Google Cloud user here. A warning: If you ever happen to get, say, frontpage on reddit or techcrunch or other big boost to publicity, your site could be down until the next billing cycle (i.e. 24 hours) and you will have no way to fix it.

This bit me hard one day with appengine and lost us a ton of converting traffic, even though we tried to get the limit increased within ten minutes of the spike (and well before our limit was hit).


Urgh, that's terrible! That definitely shouldn't be the case. I'd love for you to send me any support case you had so I can review.

Even if the front door of the system didn't help you, we definitely shouldn't have been able to get you in a good state much quicker. My profile has Twitter and my DMs are open (I can give you my email there too).

Doing my kids dinner, so responses might be slightly delayed.


This is exactly why I can't trust Google services though. The services are rock solid, but far far too often do I see folks who have to reach out to a dev on Twitter or their friend at Google to get a simple billing error resolved correctly.

Its a problem with most cloud providers, but Google seems to be notorious for it.


In my experience none of those folks are paying for support.


From my experience Firebase are different. They're now a part of Google but support has been excellent anytime I needed it - both before and after they were acquired.


It's quite the phenomenon. Reaching out via social media to garner a company's attention when all other avenues have been exhausted... what a world we live in.

"Company ignoring you? Send out a tweet, that'll work!"

And it blows my mind that it actually does. It's very sad.


This is a pretty common train of thought, but it's not really a question of deciding between "my bill will explode!" and "my app will go down!"

You just really have to put some serious thought into what your daily limits should be, and add some reasonable alerting to detect surges. The tools are there and they're not terribly hard to use. It just tends to be an afterthought for most developers because this doesn't look like a customer feature.

Without these sorts of automated scaled services, the traditional behavior is "your app goes down". This is a big improvement!


No, if your app is failing in a way that a page view now costs $1.00 instead of $0.0001 for example then the application should go down.


Your app shouldn't just fail if an API is unavailable. You should code things to fail gracefully. In the case of a traffic spike replacing the front end with an email capture form and a message saying "things are really popular right now, how about we remind you about this tomorrow (with a discount code!)" works well.

This is hard for some things, but your startup failing because you didn't want to do it is much harder in the end.


Most applications have atleast 1 SPOF, the DB. Just because the DB is now the Firebase API doesn't change anything.

It would be very difficult to build products in a reasonable amount of time if everything has to be coded defensively. I can build my app quicker, and remain sane, if I assume that the DB will always be available, and just fail if the DB isn't there. Same for things like S3 (which I think had only 1 large scale failure in recent history), Redis, etc...

There are APIs which can be unavailable and you need to work around those. For me, these are mostly third party services that I don't control. But then again, I'm not building the next Netflix. I don't have enough engineers to build an app that works with a chaos monkey!

Not the best approach for all applications, but has been good enough for most projects I've worked on. Just my 2 cents.


Yeah, that would happen with the Google Maps API before. 150k request limit was not enough for a front-paging, and there was no rollover.


So basically we want to have cake and eat it too.


This seems like an overly cynical/snarky response.

It's not an unreasonable request that for services which advertise the ability to scale up and down on demand, that the billing and billing limits should also be able to respond similarly.


>that the billing and billing limits should also be able to respond similarly

How so? With a pay-as-you-go system, firing off warnings and giving a projection of their future costs (which is hard when startups tend to have spikey traffic) is about as good as you can do.

Edit: I should add that the common solution to controlling your billing in situations like this is having some overflow path built into your beta app ("Sorry, we're not taking new users at the moment" or the like).


The problem is that the user gets those alerts, tries to change the limit they set, and it doesn't work. That is definitely not as good as can be done.


Or have it be adjustable within a reasonable time frame?


A current happy Google Cloud customer here, who also received the Startup credits. When you have credits on your account none of the alerts or budgets work (at least they didn't when I was using it). The only thing you could do is look at your figures daily and plan/scale based on the running total.


How did you get your Startup credits?


My company was part of a partner acceleration program, however you can apply to be part of the Google Cloud for Startups program here[1], without needing to be in an accelerator.

[1] https://cloud.google.com/developers/startups/


Thanks guys. Appreciated


Not OP, but when our startup was accepted into https://www.thefamily.co/ (which is basically a European YC), amongst other perks, we got a huge amount of AWS credits. I'm sure other incubators/accelerators do similar offers for GCE/AWS.


I would very much like to have something like Google authenticator, but for billing. With the ability so set it as alarms on my phone (and coworkers), preferably with some smarts to detect short usage spikes.

In essence, settings and updating amount, rate and velocity ( speed of rate change ) caps on the fly.

Then I can set whatever tight limit I want to, and not worry about burning through too much cash because of some simple coding, or config error.

A bug almost cost us several tens of thousands of BigQuery costs when a dev accidentally repeated a big query every 5 seconds in an automated script, and while we still had budget warnings, it still cost us a fair bit of money. Even after this, I found it tricky to set/catch budgets for single services. I think I had to use stackdriver to be able to get any kind of warning.

It was in the ’blinking lights and sirens’- territory fast!


Hey! This might not be the best place to contact you, but my team is currently in crisis from a nasty Android Firestore bug. When the user's device switches networks it loses the real-time event listeners and in some cases doesn't re-establish until the app is reinstalled. We submitted a bug report and source for an app with the issue but haven't heard anything in a week. We love the product otherwise, but I thought I'd bring it to your attention.


Hey there. Thanks for letting us know. We unfortunately don't have a fix for this at the moment. We've seen reports on network connections not being reestablished, but have been unable to reproduce it so far. If you've filed a bug report, the best way to help us reproduce it is by adding details to the report. That then also gives our team a way to get back in touch with you once they're able to reproduce (and then fix) the problem. Thanks again for the report!


How about having default conservative alerts built in for new accounts (not caps, just alerts). That way people who forget to set them will be reminded the first time they get one.

An account that goes from $0 spend to $30k in 72 hours should really trigger some kind of flag - even internally within google. What if they didn't have any kind of grant and weren't able to pay?


Are the caps on by default? If not, they should be.


> This is why infinitely scaling pay-as-you-go cloud services terrify me.

It's been a little over a year since I used Firebase in production, so maybe this has changed, but the funny thing is Firebase DB doesn't infinitely scale, despite them advertising that it does.

The Firebase DB caps out at 100k active connections and according to them (at the time) it's a technological limit on their part, so they cannot go higher even if they wanted to.

When we brought this up, they told us they were technically unlimited because you could shard your data into different DBs if you needed more connections, which is like saying all restaurants are all you can eat because you can keep buying more food.


Note this is about Cloud Firestore, not Firebase RTDB. It's very different infra. We announced our beta limits increased to 1M concurrent connections at the Next 18 conference, and we'll continue to improve from there.


The link posted is talking about the Firestore and not the Real Time. Firestore is in beta state that has a beta limit of 100k sim. connections, probably off beta it will be able to handle more connections. The Real Time has 100K limit and is a centralized db.


How do pay-as-you-go cloud providers handle DoS attacks (the "smart" ones that simulate a lot of expensive legit-looking traffic, not simple volumetric ones)?

That's the thing that terrifies me. If I'm using S3/Cloud Storage etc., I'm getting charged for each GB of outbound traffic, and I have to assume that the bandwidth available to serve my files is almost infinite.


You should rate-limit and cache the potentially expensive paths of your app. There's no real way to protect against a clever DDoS, someone will have to eat the costs.


You can’t really rate limit server-side with Firebase. If a client has read-permission on a data node, an attacker can just read from there as many times as they want.

There are some ways to limit writing using the rules engine but you’ll still get charged for failed writes. :)

Only real way to rate limit firebase that I know of is to put some sort of proxy service in between, but then you lose a lot of the advantage of using FB in the first place.


I will also stay far away from these services. As a child post points out, if you do set a cap, you run the risk of cutting off a legitimate spike in high quality traffic. How do you tell the system to shut off due to an error and stay on during a spike in normal traffic? You can’t, unless you have someone watching it at all hours.

That’s why I prefer to just rent whole machines on AWS. If I accidentally have some code stuck in an infinite loop making some O(1) call, don’t charge me $10,000 for that when it costs you nothing. If it’s actually consuming electricity and significant resources, I’ll know quickly because my service will go down, as it should, not scale infinitely until my company is bankrupt.


Even with a cap, a rogue line or a legitimate surge in traffic could shut down your app.

It's endless bill monitoring and budget approval.

I'll stick to a flat rate DO droplet.


What’s the difference between a cap limiting your traffic and server cpu maxed out, or db connection pool or ... limiting your traffic?


One bankrupts you, the other just has a minor hiccup with a few lost customers that day.


a (cost) cap is a feature designed to prevent you from going bankrupt but at the expense of not recovering from the surge after you deplete the cap until the end of the cap granularity.

So, if the cost cap is defined as $n per day, once you deplete it you'll be down for the rest of the day (or if you take some manual action to increase the cap, if the cloud provider supports it).

This problem is a function of the granularity. Imagine a system that let you say:

"I want to spend max $n per second with a extra burst of $m per day/week"

You adjust your "$n" to match the throughput you'd have if you had a fixed size "pay what you provision" system and reserve $m for lucky events as landing on HN.

The amount of planning you have to do is similar to the traditional resource allocation, but with the benefit of paying less than provisioned if you're not getting all that traffic.


Is it possible to make a similar approach in cloud? Something like "if traffic goes up, scale up until $5/month and then don't scale further, let it be slow (or even better throw errors)". Would be the best of both worlds.


For services that are pay per request, this is effectively a hard cap.

For more capacity based services, sure, but it's more likely to be down than just slow. When systems are run at their limit they rarely operate the way they did with a little less traffic.


Well, a rogue line could potentially as much shut down your app hosted in DO.


That's the thing though: If you're a big company, your site being down is much scarier than a 100x bill.

When you're an individual, the potential for a $10k bill is much scarier than your hobby project going down.

When you're a small org/startup, the potential for a $75k bill is probably still scarier than your site going down.


Well, then the issue with the Amazon Lambda is not that of it's ability to scale, but that of it's default not to limit your spending.


From what I understand, AWS at least is pretty good about giving you at least one "get out of jail free" card if things go awry.

Caps are tough though. I can certainly understand a use case that would want a hard circuit breaker that just kills everything it can once it hits a certain threshold. Sort of; you presumably don't want everything on S3, for example, to be deleted.

On the other hand, moving up the scale of serious businesses, I can imagine it would be hard to specify circuit breakers (rather than just alerts) and you get into issues of terminating services that affect all sorts of other services across the entire account.


Caps are definitely hard, getting them right as both a feature as well as setting them as a customer is still an art. This holds true for 'flat rate' too, where your trading correct sizing vs downtime when either get code wrong or get unexcepted success. I've been involved in all these situations so I definitely emphasize.

We had daily hard caps and budgets alerts, but it's still an area we can do better.

(Disclaimer: Product Manager for Cloud Firestore)


There's nothing guaranteed about a "get out of jail free" card, so factoring that into your decision making is probably a bad road to go down.


I'm not suggesting that. For a lot of casual users I'm not sure that cloud services are necessarily the best approach for an externally-facing service. They may well be better with a VPS that has a hard resources/dollar limit.

Perhaps cloud providers should have some sort of hard circuit breaker option (though it won't help for some things like storage) but it's probably not a priority as not a lot of businesses--their primary customers--would be OK with effectively hitting the power button for their entire cloud account if they exceed some dollar amount that someone or other configured a couple of years ago.


You only have to be 'terrified' if you don't write good code. Maybe people should actually test their code before pushing it to production?

It's not firebase's fault you ran bad code and have a huge bill. They have bills to pay also. Think they can tell their vendors 'sorry we can pay you this week. A customer ran up a 30k bill and they can't pay it. So we can't pay you right now. but lol bad code right? '



Well, since Firebase has lowered the barrier of entry with their easy setups, I think it is justified to expect at least some developers to make mistakes.

I don't blame Firebase at all though -- great product.


I spend a fair amount of time on HN.

Among many, I think this article is probably the most succinct endictment of ADHD-ridden "modern" web programming/ecosystem practices I've read.

It's so sad to me that while the name dropping and churn for frameworks and languages continues, frenzied and unabated -- basic (pun sort of intended) analysis and problem-solving techniques go out the proverbial window.

Why learn to think critically when you can just 'npm update', fix 37 broken dependencies, and write a blog post about it? Right?


Critical thinking is a learned skill. For most developers it comes with experience, and many developers in startups are often yet to learn it. Thinking well under pressure is even harder.

This is more a problem about startups using inexperienced developers than anything related to what they're building or which tech they're using.


It comes with experience, but also with more senior devs explaining it to you.

Seing someone use firebase to save paiements, then recompute a total from a collection, and as a consequence having its system explodes with less than one session per second, means everybody in the team drank the « let’s use this nosql google shiny tech, it’s so cool » cool aid.

Even one conversation with any senior dev having some kind of experience with backend development would have asked about expected load, types of queries, data model, etc. And concluded that storing paiements was probably the least interesting scenario for using a tech like firebase.


Indeed, if your building anything, good metrics are one of the most important things. Didn't see them having any metrics apart of Google analytics.


And most companies not knowing how to measure experience, so they pass over experienced devs and can't understand their value...


This is the result of top-down learning. You learn the very latest tech and work your way down to the metal as required.

Bottom-up learning starts at the metal, at the very fundamentals of computation, and builds upward.


Definitely looks like several "teach-able moments" here: They learned the hard way about:

1. Developing a fix without understanding root cause (try-something development)

2. Sufficient testing, including load testing, prior to initial deployment

3. Better change control after initial deployment

4. Sufficient testing for changes after initial deployment

5. Rollback ability (Why wasn't that an option?)

6. Crisis management (What was the plan if they didn't miraculously find the bad line of code? When would they pull the plug on the site? Was there a contingency plan?)

7. Perfect being the enemy of good enough

Looks like they were bailed out of the cost but what if that didn't happen?


2-7 are sort of understandable for a quick, hacky startup just trying to ship something fast, with minimal experience. But 1 is the really crazy one. Load times spiking to 30 seconds once they start getting significant traffic, and instead of doing a solid investigation, including instrumenting their FE and-or server so they can see where the slowness is (maybe Firebase even includes decent observability by default?), they go all in on upgrading Angular for seemingly no reason? That’s just ... completely illogical.


That raised an eyebrow for me, too. I recently moved a project from Angular 1 to 6, so this is fresh in my mind. There aren't enough changes between 4 to 6 (or even 1 to 6) to cause mysterious 30 second lag times by themselves. And upgrading the front end's base framework under a time crunch is almost always a bad decision, regardless of which framework you're using.

In the companies I've worked for, these guys would be written up and likely put on a performance improvement plan, if not flatly fired.


0. Choose a platform that doesn't bankrupt you if you're less than perfect at 1 to 7.


Try-something is very useful as a troubleshooting tool, when you need to change the state of the issue enough to collect further troubleshooting information.


Yes, it's a good investigation tool, but definitely not a fix. That is, if "it started to work now, I don't know why", you're still in trouble, and should continue digging.


updating the framework version is like.. literally the last try-something thing to do though.


I think "try-anything" would be a better characterization in this case. They had smoke pouring out of the engine and tried changing the tire.


No disagreement there.

I don't understand how you can build a complex application like that without doing basic performance checks like, are we hitting the file system or database too often, our the image assets correctly sized, etc.

I'm not a software engineer however.


0. shut the app down before it costs even more money.

A money clock on the table isn't fun and if you replace the the app with a landing page and a newsletter form it's completely acceptable for the visitors.


Anyone else feel like they'd rather have their dedicated server slow down instead of wrack up a $30k debt?

This is this nightmare I envisioned with cloud services, a client gets hit really hard, and I have to pass the bill on to them.

This reminds me variable rate mortgages.

With dedicated hardware, you may end up with performance issues, but never a ghastly business-ending bill. How does anyone justify this risk? I really don't understand the cloud at all for such high cost resources with literally unlimited/unpredictable pricing.

Can someone explain this risk/reward scenario here?


> How does anyone justify this risk?

I'm more concerned about the risk mitigation strategies (capping) I'm seeing advocated.

If your servers being pegged you've only got a few customers missing out while it's pegged, maybe even everyone getting service but sub-optimally. You can ride out the wave and everything goes back to normal.

Putting caps in place is like pulling the plug out of the server after the CPU has been at 100% for 5 minutes and not plugging it in until the next billing cycle.


You use dedicated hardware with manual or semi-automated scaling until you're big enough to want unlimited and green enough to eat a large bill knowing that you still made a profit.


I don’t know what kind of load you expect but web servers don’t just slow down linearly forever: at some point when traffic exceeds capacity, queues fill up, memory maxes out, and all requests grind to a halt with enough load. The way out is scaling horizontally, and you want to do that at some point before you end up sending 503s to the load balancer.


I find ridiculous that their first solution was to go and upgrade to another Angular version, especially a non-beta version upgrade of a framework that is used in thousands of super high traffic websites with no problems. How clueless can you be?


> How clueless can you be?

I mean, if you’re junior and you’ve just learned JavaScript, it’s not difficult. I’ve met a lot of monied people who seem to think a junior dev with a few weeks of JavaScript training is equivalent to a senior engineer with a degree. It never works out, at least not for the smart people.


Isn't this why you do pull requests, code reviews and QA before deploying? You need to make sure that the more senior(s) on the team and proper testing catch these things before deploying.


Why are you so against junior devs? I've personally never met a junior dev who just starts going ahead and upgrading various dependencies, as you suggest.

If anything, that shows a lack of proper hiring decision on you and your team's part.

I do, however, agree that their practices are horrible (just look at their console, they're console.logging random things, running the dev mode of Firebase, and fetching some USD conversion call 10x on load with no caching) and they're lucky Google bailed them out at the last minute.


> If anything, that shows a lack of proper hiring decision on you and your team's part.

Hey, friend! I had no control over hiring for that gig.


I can't say enough good things about Firebase and GCP in general, but I'm always cautious when using Firestore in particular. I usually avoid unbounded queries altogether, and treat it primarily as a key/value store to get by id.

When I do use queries, it's always in places where the results have a well-defined limit (usually limit = 1), e.g. finding the most recent X or the highest X.

With the above two, you get all the greatness of Firestore, but with a well-defined (low) cost that you can calculate ahead of time.


We've also been rolling out updates to rules to enable you to enforce these types of things. There is performance implications to limit queries for the real-time update system at scale, but for most use cases this shouldn't be a problem.

Definitely more we can improve here for control, and we're open to feedback.


Hey, thanks for listening.

I believe in short time would be nice to have a way to use for the query the create, update and write time of a document. Now, I am doing the creat time manage inside of my document with Date.now(), but when I was running a bunch of promises to create the documents, the createTime between the documents that I was manipulating was in same case the same, so my pagination failed.

Another things, like compound queries inside of subcollections should be nice. A way to export all the database for backup.

A flag to alert the Firestore to return the document when I do an update in this document in the same response (one round trip (dynamoDb has it). I know I can reach this goal witht transaction, but I believe it is simpler than a transaction.

A way to update a array without transaction.

Thanks


My open feedback is to get these rules in place as soon as possible. This story/blog makes me realize that GCP/Firebase does not currently play nice to dumb mistakes and 24h away from the dashboard can be disastrous.

Multiple rules/filters need to exist to trigger SMS/Email alerts, or a pre-defined action, upon certain conditions.


The saddest thing here is that they seem proud of this "mistake" and how they solved it.

This startup mindset is not always good.


I feel depressed about this, on how the industry promotes and even supports extreme technical incompetence. Maybe is a consequence of "everyone should learn to code" campaigns and bootcamps.


Cheer up buddy, you were this incompetent at one time too, but now look at you, a full grown narcissistic asshole senior developer. You have come a long way and these kids will too!


I increasingly feel that modern pay-as-you-go services are so opaque to consumers that it takes the individual employee's empathy (highly subjective) or the publicity like this (highly subjective and you have to be lucky as well) to fix any significant problem. Every time a post like this with a "happy" ending crawls into the HN front page, there would be hundreds or maybe thousands of "unhappy" endings out there.


Simple stress tests should've revealed this. Basic profiling should've revealed this. This article makes it appear that they've went live without ever really testing the infrastructure under any kind of load whatsoever.

The article refers to some mysterious "engineering team". It would appear very little actual engineering took place at that company.


Some have mentioned it here already, but I'd like to emphasis how important application logs are. How much trouble you can prevent by reading and understanding them.

I've seen and fixed such bugs as described in the article, and before you start trying to upgrade anything a look in the log followed by a git bisect session is the first step.

My rails apps have great logs, I get to see what views and partials are rendered, what queries are sent to the database and more important how often all that happens. If the log excerpt for a single request doesn't fit on my screen I know I have to do something.


"It is very important that tech teams debug every request to servers before release..." No.

You should know your application's profile, you wrote it.

How many resources does you app need? That's something our developers believe is the "operations team"'s responsability. Well, now that you took the 'devops' role you can no longer keep ignoring this. Your new infrastructure provider will be more than happy to keep adding resources, one can only hope the pockets are deep enough.

With attention to the profile this would have been caught at developing time, maybe testing time.


>with every visitor to our site, we needed to call every document of payments in order to see the number of supports of a Vaki, or the total collected. On every page of our app!

Oh that will do it.


"Besides they understood errors like ours can happen when a startup is growing and some expensive mistakes can jeopardize the future great companies."

Doesn't sound like a future great company to me, especially when their lesson from this was Google will bail them out and "It is very important that tech teams debug every request to servers before release." rather than hiring less cavalier employees and putting in better process.


I've been noticing a steady rise of posts from hackernoon by amateur developers who think they'll be the next great tech blogger. I'm not saying I could do any better, but why are these posts suddenly getting so much attention?


> but why are these posts suddenly getting so much attention?

The same reason we slow down for car crashes, morbid curiosity. I don't think there is anything "sudden" about it though, we even have sites like thedailywtf dedicated to this level of idiocy.


thedailywtf: making fun of other dvelopers' incompetence

hackernoon: author patting themselves on the back while everybody else is laughing at their incompetence


As one of those amateur developers who publishes to Hackernoon, it's because they change titles to fit their "How to" type articles + grab attention.


Disclosure: I work on Google Cloud (but not Firestore or Firebase).

For those that didn't read the article, it had a happy ending:

> GOOGLE UNDERSTOOD AND POWER US UP!

> After we fixed this code mistake, and stopped the billing, we reached out to Google to let them know the case and to see if we could apply for the next grant they have for startups. We told them that we spent the full 25k grant we had just a few days ago and see the chance to apply for the 100k grant on Google Cloud Services. We contacted the team of Google Developers Latam, to tell them what had just happened. They allowed us to apply for the next grant, which google approved, and after some meetings with them, they let us pay our bill with the grant.

> Now we could not be more grateful to Google, not only for having an awesome “Backend As A Service” like Firebase, but also for letting us have 2 million sessions, 60 supports per minute and billions of requests without letting our site go down. Besides they understood errors like ours can happen when a startup is growing and some expensive mistakes can jeopardize the future great companies.


Perhaps somewhat optimistically, I assume most of the commenters read the whole article, but I don't think the happy ending abates the concern. It's great the Google Cloud team was able to bail them out afterwards, but the fact that they were able to rack up a $35,000 cost on a code mistake still highlights one of the major flaws with pay-as-you-go cloud computing.

There's no guarantee if I made the same snafu next week that Google would necessarily be willing to help, but I can absolutely guarantee you that a VM sitting on a Dell PowerEdge I've got lying around would suddenly obligate me to a $35,000 bill, no matter how bad my code.

Ideally, I guess I'd want to see rather than a hard cap, some sort of smart alert that would go "holy crud, this is an unusual spike in the rate of requests" when the delta changed unusually, rather than waiting until I say, hit a high static cost bar or a hard cap that kills the site.


Right, I don’t deny that (and didn’t mean to imply this isn’t a problem). At the time of my comment it did seem like most people stopped before the end :).

Budgets and quotas are really tricky as Dan pointed out elsewhere in this thread. App Engine has had default daily budgets (that you can change) forever, but then you run into people saying “What the hell, why did you take down my site?!”.

In this case, they even intentionally pressed forward once they saw their bill was going up. If this had been say a static VM running MySQL with a “SELECT *” for every page view, the site would likely have just been effectively down. For some customers, that’s the wrong choice, even in the face of a crazy performance bug.

That said, we (all) demonstrably need to do better at defaults as well as education (the monitoring exists!).


That, or, as explained in the SRE book, the extra load should've been shed. The binary choice is not always the right one. Though, granted, in this case, it wouldn't have mattered...


Yeah, I think "the web server ran out of money at 3 PM today" isn't the right choice, whereas "a lot of people had a hard time getting to the site when it got HN'd around 3 PM, but it was back to normal by 5 PM" is a better result.

For smaller organizations, something being down during extreme load is a recoverable problem, but owing the cloud provider all of their money may not be. (Note that even in the case here where Google got them the grant to cover this bill, this is still probably 35K in grant money that could've gotten them further or been used better elsewhere.)


I don’t see why this is being downvoted. The title of the post should be changed to “We were given a $30k credit by GCP after not monitoring our account charges”


A lot of people here talks about lack of load testing and other "do it the right way" type of advice, but remember this is a startup. In my opinion, solid testing foundation will be such an overkill and the time is better spent on implementing more features.

Also I bet they did some manual testing. They didn't catch it because this latency can only be seen by the account with a lot of followers.

I agree that their first solution to upgrade is a bad idea... You should understand what caused the bug before trying to fix it.

I highly encourage you to monitor the load/pay/request graph on a daily basis. Even better if you hang a screen on the office that displays these. The graphs are already provided by Firebase. That way you can catch these type of anomaly on day one. Also Firebase supports "Progressively roll out new features" https://firebase.google.com/use-cases/#new-features


Lack of testing and poor code quality are completely different things though.


The blog authors should state up front that Google covered the costs with an additional startup grant, and that this was entirely due to a quadratically expensive query.


"We contacted the team of Google Developers Latam, to tell them what had just happened. They allowed us to apply for the next grant, which google approved, and after some meetings with them, they let us pay our bill with the grant."

Makes for an interesting counterpoint to the currently popular "Google is evil" narrative. The truth is probably much more mundane: Google is an awful lot of people trying to work together to achieve a bunch of shared goals and doing an imperfect job of it. This isn't just rose-tinted: I'm quite sure they have their fair share of bad actors, and they certainly make decisions we don't all like (e.g., retiring products), but I don't think it's because the company is fundamentally evil.


I agree with your argument, but I don't think that people are calling Google evil because they retired some products. See [0], which was resolved (I think, though I can't find a source for that) due to the fact that Google is comprised of a lot of different people, and [1], which is a fairly recent announcement that has made some people uncomfortable.

[0] https://news.ycombinator.com/item?id=17202179

[1] https://news.ycombinator.com/item?id=17660872


For me, I find what they're doing with android pretty evil (disallowing manufacturers from also offering phones with alternate distros/os')


Except that their core business is to build files on everyone of their users to sell ads...

We could have a debate on whether Philip Morris is evil. I am sure most of their employees are pretty decent people.


When was the last time the consumption of a Google ad lead to cancer and the eventual death of anyone?


> When was the last time the consumption of a Google ad lead to cancer and the eventual death of anyone?

Great logic. Philip Morris is evil because they sell products that cause cancer and death. Therefore Equifax, who extorts money from people to protect them against Equifax polluting their credit rating, and who leaks their data into the wild is not evil, because Equifax doesn't cause cancer nor death!


When was the last time Google had a data breach and leaked my social security number they didn't have and didn't use to "extort" money from me?


But did you see the slow down in the request to firebase before going to upgrade frameworks?


Words fail me as to how anybody could put something live without understanding how the application functioned as a whole. Unbelievable.


This is another good reason to properly test before going live. If their site wasn't cloud hosted it would have just fallen over, most likely. Which means a failed crowdfunding effort. Maybe take off the cowboy boots, guys?


"This means that every session to our site read the same number of documents as we have of number of payments. #UnaVacaPorDeLaCalle received more than 16,000 supporters, so: 2 million sessions x 16,000 documents = more than 40 Billion requests to Firestore on less than 48 hours."

TLDR; Horrible architecture decisions like this can be very costly.


I don't think they seem aware of the architecture problem, besides they seem proud of having 40 billion request.


I wouldn’t call it architecture decision. It’s the equivalent of a bad SQL query.


A bad sql query wouldn't do that. Look at their site, from the US, it calls cloud functions for a COP to USD conversion, on each place they render a currency (200+ requests just going to their homepage). I think it was built poorly, but that's just my opinion.


I agree with you. There are so many ways to make this infinitely more efficient. For starters, why are they re-calculating and re-rendering the value every time they get a new donation? Also, they could store those values in a more cold&cached-storage and just make the reads to update from Firestore. Don't use a freaking database that is charging you for every single read and write, to deal with mundane client-side renderings.


Changing the currency will re-trigger all those conversion calls ...


It's more like a naive database design which doesn't consider the access pattern and therefore requires an inefficient query, which is definitely an architectural issue.


It's only 460k QPS, with 16k documents everything would be cached really well. Or a single instance of redis can serve that load of reads fairly easily.


Good luck with that. A single Redis would not be able to serve that workload. Maybe a single machine but you're really pushing the limits there just with concurrent TCP connections.

At that qps Redis has 2 microseconds per request.

I agree it caches well but your proposed architecture is definitely not production quality.


There is a difference between queries and connections. With pipelining, a redis machine can easily serve that. If they stored things in a better structure, like a list, they'd easily be able to get that.

On my machine 5 concurrent requests, at 100 items, it can do ~6 500 000 items per second, at 300, 11 000 000 per second, and it kind of caps out at that. Even with 1 concurrent connection, at 600 items, you get 6M per second.


Legit question here; what would be a good architecture for this case?


SQL with no ORM. Harder to be unaware of what you’re querying when you have to write the queries.

Not using attempted magic like Firebase would also fix the problem where the home page transfers 9 MB of data from Firebase on top of the 1 MB JavaScript, which appears to be… their entire database or something?? Accessible to the frontend??? Censored excerpt from that response:

  "email":{"stringValue":"soXXXXXXXsu@gmail.com"},"fechaDisponible":{"integerValue":"1527483600000"},"identification":{"nullValue":null},"key":{"stringValue":"1527560309061"},"listaNotificaciones":{"arrayValue":{"values":[{"stringValue":"alXXXXXXXXi@yahoo.com"},{"stringValue":"soXXXXXXXsu@gmail.com"},{"stringValue":"pXXXXX2@hotmail.com"},{"stringValue":"BeXXXXXXXXXXXXXXez@hotmail.com"},{"stringValue":"pXXXXXX0@gmail.com"},{"stringValue":"trXXXXXXXXXXro@gmail.com"},{"stringValue":"joXXXXXXXXXXXXie@hotmail.com"},{"stringValue":"ivXXXXXXXXxd@gmail.
Also appears to expose, for each campaign, the poster’s bank name and date of birth.

And wastes a bunch of various resources making separate requests to a currency conversion service for each amount, as others have noted. And requests /null and /undefined. This might be the most irresponsible development I’ve ever seen.


Exposing contents of DB without even any SQL injections... I think they have no idea about network monitoring. They _have_ an idea about console, since they're logging stuff there extensively.


> we were using Angular V.4 and we decided to upgrade everything to V.6. It was a huge risk and we wanted this campaign to be perfect, so we did it! In just a day and a half, our team had the first release ready in the new version. After some tests, it looked like the refactor helped the app’s speed, but it was not as fast as we wanted it. Our goal was to load in 3 seconds and it wasn’t working as we expected.

You want it perfect and you cannot afford to put down the site but you're willing to take a "huge risk" based on (wrong) guesses, with clearly not enough time for proper QA. I sincerely suggest you to slow down and reflect on priorities and risk assessment, there's a reason if that firebase code slipped through. By the way I'm happy you avoided the worst case scenario, good luck with your project


> I sincerely suggest you to slow down

That runs counter to the great Silicon Valley ethos of moving fast and breaking things


The moving fast breaking things guy came from the east coast.


The thing I'm wondering about with the seniority of the dev team and what the code commit and code review process is. It seems like with a senior dev reviewing commits, someone would have caught that redundant work was being done.

I worked at a startup in the Mission for a few months and I remember seeing a quadratic query that ran for every customer that was logged into our application. The CEO and team lead wondered why our app worked great in Dev (with only 5 users) but was terrible on premise (250 users). When I tried to explain the issue the two devs before me didn't really understand what I was talking about. It was a quick refactor and caching solution that fixed the problem, but it was clear that the development was still new.


The issue with GCP and AWS providing limitless scaleability is that bad code gets a free ride and we all occasionally write bad code. If the OP had tested this on a database with 1 CPU server with 2GB RAM, they would have caught it quickly in early dev testing.


The company success is good, but the SELECT ALL query is a shame bigger...


It shouldn't be _possible_ to run select all in production.


So it's a classic N+1 query?


It's more like (in good old RDBMS world) 'SELECT * FROM payments' and calculate sum on client vs 'SELECT SUM(amount) FROM payments WHERE some_id = ?'

though they already had this sum precalculated


That would be more forgivable. This is:

  SELECT * FROM payments where paymentID = 1
  SELECT * FROM payments where paymentID = 2
  SELECT * FROM payments where paymentID = 3
  ...
  SELECT * FROM payments where paymentID = 14986
Each of those in its own API request over the wire, then sum them on the client.


Not really. N+1 is about « to many » relationships. You would need at least 2 collections to run into that problem.

In sql equivalent, what happened would be more like doing a full table scan on each query instead of using an index ( and not even that, because pre-computing a total isn’t really like an index)

This kind of « account balance » problems are typical of the problems where transactions are really useful. But also historically the kind of problem where nosql techs do a poor job ( they’re more built with « eventual consistency » in mind than atomic or transactional behaviors)


| How we spent $30k in Firebase in less than 72 hours

By not checking Google Billing after you launch your website. At the very least you should have a billing alert.


At the very least Google (and really all cloud services) could have reasonable default billing alerts built in. And automatic spike detection.

I'd rather have default alerts already configured that I can change, rather than none.


And that's why people write applications to understand and track your public cloud usage, like https://github.com/trackit/trackit ... Most people simply can't follow through what's happening in their cloud apps.


So what is their product? Can't read spanish, and don't use chrome for translate.


A GoFundMe kind of platform, that allows groups of people to put funds towards a common goal (like a party or whatever)...in this case they used it to donate to a politician who had a huge debt after losing an election.


Wow, a complete shitshow which could've been avoided by consulting with 1 experienced developer good at debugging. This kind of negligence must be happening more than what we think so there are lots of opportunities out there.


Gosh. One for https://accidentallyquadratic.tumblr.com/ with the added spice of being on usage billing...


This reads less like a "mistake", and more like just not giving any thought to what your code is doing as you write it. A mistake would be if the programmer meant to implement X but instead implemented Y; this sounds more like the programmer just set out to implement Y without even considering other possibilities.

Same for the framework change. An outdated framework might be a second or so slower, but a >30-second load time was never going to be fixed by updating. This is just bad problem-solving skills. When your app is taking 30 seconds to load, you don't just guess at what might make it faster -- you open your JS console and your log files, and you figure out where that time is being spent. Two minutes in the Performance tab of Chrome's developer tools, and you would have figured out the issue was on the back-end rather than the front-end.


They calculated a number the wrong way. That's not a hard mistake to make. And changing frameworks is silly but not really part of the problem.


If there's one thing Google App Engine has taught me, over the course of 5+ years using it:

Denormalize your data (or, optimize for read).


Just a precision in case a junior dev read this comment : denormalizing has nothing to do with the problem experienced by the OP. They just queried the whole collection of paiements instead of precomputing a total everytime a new paiement arrived.

Denormalizing would have required having a relation between collections. Here there was just one (so there was nothing to denormalize).

Optimizing for reads ( or simply think about read performance) doesn’t require denormalizing. It can also be a matter of creating an index, or precomputing values in a cache, like in OP case.


My only issue - Where were the metrics around firebase request count? That could have been a simple dashboard.


It seems like they:

* didn't have good testing

* did no load testing before

* had no code reviews done

* no design reviews done

* had no or very less (useful) application logs

* had no change control mechanisms defined or followed (upgrade a framework in production in a matter of minutes or hours as a way to wing it out and pray for it to work out?)

* had no or very little automated tests

* didn't have a detailed post mortem or root cause analysis to see what they could do to prevent it (the ending looked quit amateurish, by pointing to only one thing as a potential lesson)

* wasted a lot of money that could've helped in the future (by instead using that to pay for en error)

If I were in such a team, I would've honestly stated in such an article how deeply ashamed I am that we missed all these things, how "cowboy coding" and heroics must never be glorified, how we got very, very lucky in someone else waiving off charges (this is not a luxury that most startups or one person endeavors would have), and ended with asking for advice on what could be done to improve things (since it's obvious there were many more gaps than just how a few lines of code were written).

To the team that wrote this code and this article — get some software development methodology adopted (any, actually) and some people who can help you follow any of those. Also read the rest of the comments here. You got very, very lucky in this instance. It may not be the same case again, and you may see your "life's work" get killed because you didn't really learn.


If they had half of the above they wouldn't use Firebase. Don't get me wrong I think it is a great product and I would use it for MVP but the idea of using Firebase is to get something scalable easily and quickly by focusing on the frontend only. You can tell from their screenshots that they were only focusing on GA metrics and didn't check Firebase console


They just didn't profile before optimizing. That's all.


[flagged]


Please don't get personally nasty, even if another comment was smug.


Go serverless boys :)


It won't resolve the problem


AWS Lambda costs about 20 cents per million requests. A billion requests would have put it just a little south of 200-300 dollars. Factor in a few additional read units for DynamoDB, but I don't see it going to total $30,000. The serverless architecture should scale as well.


They are using serverless architecture. Firestore is a "Daas". The latency for loading the page wasn't the real problem, it was the side-effect. The real problem resulted in 40bilion request, they were requesting all the collection of payments in each session. It reaches 16K documents downloaded /session, so they were paying 16K reads documents for session. Firestore gives you 50K reads/day for free, and ask $0.06 for 100K reads. So It was a really bad logical approaching from them. If they did in a best way , like update a document that holds the two accumulators : total money accumulated and total payments created, and read just this document in each session request their bill to show this information for 2M session should be less than $2, because you are reading 1 document for session instead of 16K documents. And pay attention that their system wasn't down, probably the 30s latency was a side-effect of downloading 16K document of each payment information in the client side.


But wait, the image shows ~$600 million. So $30K is small.

Maybe that ~$600 million isn't in USD?


The first sentence in The Fine Article mentions Colombia, whose Peso uses "$" as its symbol, and currently has an exchange rate of ~2900:1 USD.


Thanks. And I did miss the "COP" in front of the $ amount :(


No, it doesn’t. In many other locales “mil” means thousand, unlike slang for million. This is why in finance, $5mm means five million dollars. Five mil mil. Five thousand thousand. Five million.

The graph shows a spike to around $5,000 per day ($5 mil por día). The entire dashboard is in USD, presented in a Spanish locale. That is also why the dollar sign is suffixed, the months are not capitalized, and why May has a dot after it, because it is abbreviated there (mayo).

Every programmer should understand locales even if they do not speak the language.


No, it shows "$600,043,603". Where do you see "mil"?

But OK, I didn't get that "COP" is Colombian Peso :(

And there's another image that shows total collected as "USD $244,875". The ratio is 2450, which is close enough to the exchange rate.


The Y axis of the graph which actually has relevant information. You are looking at a campaign page, and assuming those Colombian pesos are available to the author’s team.

When you said “the image” I thought you were looking at the right one, and I thought it odd you were off several orders of magnitude from what I assumed to be your misunderstanding. That explains that. I had to go back and find your figure.


Ah, I didn't look closely at the charges graph. So yes, it maxed at ~5000 USD per day.

I gotta say that using "$" for both USD and COP is confusing. So you must say "USD $x" and "COP $x". Then why bother with the "$"?


apparently the $ sign has its origins in the spanish Peso, where the p and s were gradually being merged together in abbreviations.

futhermore the US Dollar itself stems from the Spanish Dollar:

"The U.S. dollar was directly based on the Spanish Milled Dollar when, in the Coinage Act of 1792, the first Mint Act, its value was fixed [..] as being "of the value of a Spanish milled dollar as the same is now current


True enough.

But it's still potentially confusing, when stuff gets translated with no context for currency values. Especially, I imagine, if you don't know either Spanish or English.


Correct. They had it first, which is why I lightly check the “they use $ too? That’s confusing.” sentiment.

The Dutch, ever innovative in trade, can lay claim as a bigger influence on the word “dollar” and the currency form itself, however, and colonial Americans traded regularly in Dutch daalders (we still pronounce it that way, unlike doh-LAHR/doh-LAHR-ehss for the Spanish varieties). Daalders themselves were descendants of Bohemian thalers, as were Spanish dollars. We just borrowed the neighboring dollars when the time came, probably due to our foreign policy environment at the time, trade with Florida, and so on.


there's over 20 different types of "dollars"


That's really my point. Not Americanism or whatever.

I mean we have kilograms, meters and seconds. And they're the same for every country.

But "$" (dollars and other currency units) means different things, depending on the context. Similarly for ounces, pounds, feet, gallons, etc. So you're left with constructions like "US $" or "USD" or "USD $" vs "Can $" or "CAD" or "CAD $". Just as with "avoirdupois ounce" vs "troy ounce", "US gallon" vs "imperial gallon", and so on.

So anyway, I always write "foo USD", "foo EUR", "foo mBTC" and so on. To avoid ambiguity.


Context. Canadian dollars use $ too, and you only see CAD near the border or when it isn’t clear. If I’m a Colombian using a Colombian site and pesos use $, I don’t need the context. Also, properly, you’d say 5 USD, not USD $5; the dollar sigil is then redundant.

There’s a bit of americentrism down the confusing line of thought, for what it’s worth.


> This is why in finance, $5mm means five million dollars

That's not true. It's from Latin, mille.

That aside, I find it a terrible way of writing things. But then again, Americans hate SI, so I guess have fun. :-)




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: