Hacker Newsnew | past | comments | ask | show | jobs | submit | jorgeortiz85's commentslogin

You are talking about the popular narrative of “left brain” thinking being more logical and “right brain” thinking being more creative. You are correct this is unsupported.

The post you are replying to is talking about the small subset of individuals who have had their corpus callosum surgically severed, which makes it much more difficult for the brain to send messages between hemispheres. These patients exhibit “split brain” behavior that is well studied by experiments and can shed light into human consciousness and rationality.


You should apply anyway and describe your situation! Supporting additional countries is challenging from several angles (payroll, international tax structure, compliance, etc), but we can talk through your particular situation individually.


They can certainly apply as part of your team! Though you should give some thought ahead of time as to what their roles would be at Stripe. We're not currently hiring for the roles of CEO or CTO ;).


Hi, I work in infrastructure at Stripe and I'm happy to provide more insight. Several threads here have commented on our tooling and processes around index changes. I can give a bit more detail about how that works.

We have a library that allows us to describe expected schemas and expected indexes in application code. When application developers add or remove expected indexes in application code, an automated task turns these into alerts to database operators to run pre-defined tools that handle index operations.

In this situation, an application developer didn't add a new index description or remove an index description, but rather modified an existing index description. Our automated tooling erroneously handled this particular change and interpreted it not as a single intention but instead encoded it as two separate operations (an addition and a removal).

Developers describe indices directly in the relevant application/model code to ensure we always have the right indices available -- and in part to help avoid situations like this. In addition, the tooling for adding and removing indexes in production is restricted to a smaller set of people, both for security and to provide an additional layer of review (also to help prevent situations like this). Unfortunately, because of the bug above, the intent was not accurately communicated. The operator saw two operations, not obviously linked to each other, among several other alerts, and, well, the result followed.

There are some pretty obvious areas for tooling and process improvements here. We've been investigating them over the last few days. For non-urgent remediations, we have a custom of waiting at least a week after an incident before conducting a full postmortem and determining remediations. This gives us time to cool down after an incident and think clearly about our remediations for the long-term. We'll be having these in-depth discussions, and making decisions about the future of our tooling and processes, over the next week.


(Tedious disclaimer: my opinion, not speaking for my employer, etc)

I'm an SRE at Google, where postmortems are habitual. The thing that jumped out at me here is that a production change was instantaneously pushed globally, instead of being canaried on a fraction of the serving capacity so that problems could be detected. That seems like your big problem here.

(Of course, without knowing how your data storage works, it's difficult to tell how hard it is to fix that.)


Yup.

This is one of our few remaining unsharded databases (legacy problems...), so we can't easily canary a fraction of serving capacity. However, one clear remediation we can implement easily is to have our tooling change a replica first, failover to it as primary, and, if problems are detected, quickly fail back to the healthy former primary.

Lesson learned. We'll be doing a review of all of our database tooling to make sure changes are always canaried or easily reversible.


hi jorge

I'd actually applied to work at stripe about two years ago, you guys turned me down ;)

I was responsible for ops at a billion-device-scale mobile analytics company for about 1.5 years. Your tooling is far superior to anything we produced. I like the idea of a single source of truth describing the data model (code, tables, query patterns, etc.) a lot, and doubly-so that it's revision-controlled and available right alongside the code.

I think it's far from decided though, how much to involve human operators in processes like this. Judging from this answer, you seem to be on the extreme end of "automate everything". How then, I'm curious, do you train/communicate to developers what can be done safely vs. something that would cause i/o bottlenecks, slowdown, or other potentially production-impacting effects? Can you even predict these things accurately in advance? (Some of our worst outages were caused by emergent phenomena that only manifested at production scale, such as hitting packet throughput and network bandwidth limits on memcached -- totally unforseeable in a code-only test environment).

It sounds like you let developers request changes (a la "The Phoenix Project") but ops is responsible for final approval of the change? That actually sounds like a great system. Would love some elaboration on this.

In any case, great writeup and from one guy who's been there when the pager goes off to another, sounds like the recovery went pretty smoothly.


This is indeed a tricky balance. We want developers to iterate quickly, but we also want to understand the impact of production changes. With a small team and small sets of data, it's easy for everyone to understand the impact of changes and it's easy for modern hardware to hide inefficiencies. As we grow, the balance changes. It's harder for any one person to understand everything. It's also harder to hide inefficiencies with larger data sets.

We're always learning and improving. In order to scale, we'll need better ways to manage complexity and isolate failure. Our tools, patterns, and processes have changed quite a bit over the last few years, and they will continue to change. Ultimately, we want every Stripe employee to have the right information evident to them when they make decisions. This will be challenging, especially as we grow, but I'm excited to take on that challenge.

If you're still interested in working at Stripe, I'd encourage you to reapply! Our needs have changed quite a bit since you applied, and we're willing to reconsider candidates after a year has passed. Feel free to shoot me a resume: jorge@stripe.com


Shouldn't developers understand how a database change is going to impact an environment based on the code they've written?


Yes they very much should! But in my, admittedly anecdotal, experience only the best / most senior ever do. Almost every junior or mid developer I've worked with (and a small handful of senior folks) not only have no idea how changes like this would impact the larger environment but many won't even care to look into it.


In part though that's because the tooling to do it easily absolutely sucks, the impedance mismatch (overused but in context here) between the two parts of the system causes a lot of the underlying issues, better tooling is a large part of the solution I think but I've not seen anything that would help and the surface area of a modern RDBMS is so large without even getting into vendor specific stuff I'm not sure what that would even look like.


That's certain a great point! If there was a way to automatically test much of this I bet even the newest of engineers could stop this. Doing that is tough, hmm...


I think the only way you could do it on top of a RDBMS is to use a strict subset of features that are common (something that many ORM's already do) which reduce the problem scope down to something manageable, the issue then would be that there would always be the temptation to use something outside that subset and forgo the easier testing, fast forward and you have the same issue.

It would be interesting to build a RDBMS that enforced that subset by simply not allowing those features to be used/abused with support for many of the modern features (JSONB etc) but that is way beyond my area of expertise.


You would think but far too many developers don't really know how databases work under load.


Why not just use simple version controlled database migrations, and testing them in a test environment?


Generally you want your database migrations described in a straightforward manner for development; the migrations will contain a straightforward change from old to new (and back). With a live (busy) production database, it is often necessary to handle things differently to maintain up-time.

As a simple example, to make an atomic change to a write-only table, you could create a copy of the table, alter the copy as necessary, then in a single rename operation, rename the live table to '_old' and the '_new' table to live. You most likely would not want to add two additional table schema and all of those steps to your development database operations.

It's entirely possible that they could capture what is done in production as migrations, and test them first, but it would still likely be separate from what the application developers are working with.


Development databases normally have a small amount of data, so migrations should execute instantly or nearly so no matter how complex they are.


True, but I don't think it negates anything I wrote. You don't keep development migrations simple so they'll run quickly; you keep them simple so they're easy to create and understand. Writing migrations (whether automated or manual) for production is a separate task and even a separate skill from designing the database structure itself, so there's no reason why the two need to be (or should be) combined.


Meant to write 'read-only' in the example there. Those steps wouldn't work well for a table that's being written to, since it could change in the process. Anyway, it was just an example.


What kind of database was the incident on?


have you considered integrating index statistics into these changes? To take an example from mysql, there is the INDEX_STATISTICS table in information_schema that contains the current number of rows read from the index. Checking this twice with a one minute interval before applying the index drop could have shown that the index was under heavy usage, and might require human intervention.


MongoDB doesn't track this information, unfortunately.


It looks like the latest version does: https://jira.mongodb.org/browse/SERVER-2227

The problem with MongoDB is that teams think they can get away by just setting it and forgetting it. Real companies have DBAs that monitor it and understand it and make a living through it. They're just trying to automate it using fancy ui's. That's what you get for trying to automate your DBAs.


3.1.x is a development branch and not intended for production use. When they release 3.2, MongoDB will support it.


That was my thought as well, but this change was done by an Operator and not a DBA, who tend to be a bit more curious about these kinds of changes.


Wages for campus jobs for undergraduates at Stanford start at $13.25/hr, and can go quite high. http://financialaid.stanford.edu/aid/employ/wage_scale.html

And there are plenty of off-campus jobs (SAT tutoring, etc) that can pay $35+/hr.


The food benefit isn't taxed. To get the same $3,750 food benefit via salary, your employer would have to pay you $5,000-$6,000 more in salary.


I've found there are two very different worlds of software development. For convenience, I'll label them the "Microsoft" world, and the "Open Source" world. (The labels aren't completely accurate, but I think they're largely descriptive.)

The Microsoft world runs .NET, Visual Studio, C#/VB.NET/ASP.NET, targets the Windows desktop runtime (tho increasingly also the web), relies primarily on proprietary (and usually non-gratis) libraries and tools, etc.

The Open Source world revolves around *NIX, uses open-source languages (gcc, clang, Java, V8 Javascript, MRI Ruby, etc), targets the Linux runtime and sometimes OS X/iOS, relies primarily on open-source (and gratis) libraries and tools, etc.

The ecosystem differences go pretty deep. For example, even though either world can interop with practically any SQL database, inhabitants of one will largely choose Microsoft SQL Server while inhabitants of the other will largely choose MySQL/PostgreSQL.

Both can have great software development or terrible software development. It's possible to mix and match (eg, using Windows doesn't preclude you from writing Ruby).

But startups tend to choose the Open Source world, likely due to the combination of lower licensing costs and the "hackability" of open-source software. I'd argue that due to those same reasons, the Open Source world has produced more innovation in the last decade.


Luckily for him, he's working for himself. I can find that out with a couple of Google queries and a LinkedIn profile.

But I have no idea whether you work for yourself, for someone else, or whether you're a hiring manager.


We have an internal mailing list called "Crazy Ideas". No idea is too crazy.

Greg emailed the list two weeks ago with a proposal for the open-source retreat. The reaction internally was quite positive. He polled open-source maintainers externally to see whether this was something they'd be interested in. The reaction externally was quite positive too.

Greg hammered out the details and shipped it.


This idea has been making me smile for like the past 40 seconds.


And that's how you do it, well done Stripe! Glad to know that the money they received from me through their API use is being well used and having a positive effect :-)


That's awesome, sounds like a fantastic company culture. Thanks for the info.


You guys are so fantastic. this is awesome and it's great to see companies embracing open source!


Kafka uses Linux's zero-copy transfer to move bytes between the network and the disk without going through user-space, let alone the JVM.

There's still GC from objects allocated by Kafka in the JVM, but the actual message data doesn't even go through the JVM.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: