Hacker News new | past | comments | ask | show | jobs | submit login
Post-incident review on the Atlassian April 2022 outage (atlassian.com)
253 points by johnmoon on April 30, 2022 | hide | past | favorite | 152 comments



I sincerely hope all of the people who worked on the recovery effort are okay, and are being well supported and strongly encouraged to tend their mental health. I have no personal investment in Atlassian products—if anything, unrelated to this or any incident, I could happily never use them again.

But the people who work there are human, and I know what kind of a toll a protracted recovery effort can take. I know it from chronic pain two years after the burnout started. No one should experience that.

I know it’s probably not the thing Atlassian the business thinks they should communicate to customers/general public, and it’s not surprising it’s absent, but it’s a shame they did not address the well being of their staff. Especially those who undoubtedly put exhausting effort into recovering from this incident.

And sure, that would be unusual as a business communication. But should it be? Even before this severe burnout, I would have looked at this list of lessons learned and thought “a lot of people are going to be overworking the same as they’ve been through all of this, but now they’ll be doing it invisibly.” And that definitely doesn’t inspire confidence that there won’t be another catastrophic mistake in the near future—it doesn’t make me more confident in the business.

None of this is a strong criticism, just some observations from being on the other side of a marathon incident recovery.


As someone who's worked on similar situations, I don't expect any pat on the back, but see it as my responsibility to make sure it doesn't happen in the first place, and when it does, my responsibility to fix it without complaint.

Some Atlassian customers might have had much more severe (mental health and other) problems than the Atlassian staff, so I think it can be perceived as mildly solipsistic to be praising the staff and (being perceived to) forget about the customers' wellbeing (I know you didn't say anything about forgetting about the customers, but affected customers may perceive it that way).


Speaking as someone who is no longer involved at the coal face of things like that, my role in an equivalent incident resolution would be to attend some of the 3-hourly calls (this frequency sounds extreme past the first 3-4days btw), and therefore having nothing to lose from siding with you on this, I’ll disagree. Accountability would be mine, responsibility only to fix of the team. I.e. the team needs many pats on the back, as they break their backs to solve something the organisation has done wrong, such as in this case not implementing proper controls on the processes. It’s never about the running code, it’s about how it ended up in production. People make mistakes, are assigned work outside their comfort zone, are junior. And these failure modes are also operational growth modalities. But the processes need to support that growth and focusing on the debugging of the issue is necessary for resolution but is wrong when the focus moves to long term avoidance.

You can praise staff but cannot praise the customer unfortunately as that would be inappropriate. You can work on relationship of course, by not limiting your reconciliation to contractual credits on SLA. These will mean nothing to the customer staff that took significant part of the hit in work stress. Not sure what would be a good way to regain the trust at that level.

Edit: fixed accountability vs responsibility in wrong order


I think that’s a false trade off, you can empathise with both employees and customers at the same time. And while sure maybe a lot of the Atlassians might have fixed the issue “without complaint” as you said, it can still be a nightly stressful situation (probably even leading to burnout) and I hope they are looked after even if they don’t speak up.


I find it hard to imagine mental health being affected negatively by atlassian services disappearing.


I find it exceedingly hard to imagine how it wouldn’t. Think of how people use their products, and the cascading effect of that use being disrupted with long term interrupted access to information that was probably stored only there. Think of whole organizations scrambling to recreate knowledge that’s only partially in their heads.


This is a fascinating response. I’ll address the last part first, because I don’t want it to get lost.

> I know you didn't say anything about forgetting about the customers, but affected customers may perceive it that way

No, I didn’t mean to suggest or imply this, and hope it won’t be taken this way by anyone. It was perhaps a mistake taking it as read that obviously their customers were also harmed. But I do think it’s a reasonable point to make more explicit, and I do agree that it likely had a similar impact on customers to the experience I described. I’ll repeat that no one should experience that.

That said…

> As someone who's worked on similar situations, I don't expect any pat on the back, but see it as my responsibility to make sure it doesn't happen in the first place, and when it does, my responsibility to fix it without complaint.

This seems like you’re addressing something entirely outside the actual content of my comment, and perhaps projecting your own priorities onto it.

First of all, I was in no way suggesting anyone get special reward. And I was in no way referencing any Atlassian employee’s complaints nor airing my own. I mean it sincerely that I hope they are okay and that their mental health needs are being respected. It was entirely a statement of human compassion, and an observation that it’s one which can go unstated/understated in these discussions.

Secondly, I added my own experience as a personal reflection on the toll it can take. It’s not easy to say in a public forum that I’ve suffered years of chronic pain after addressing an incident. I added this for context because I think it is easy for people to dismiss the impact serious incidents have on the people responsible to them.

A side note before I get to thirdly: I think this is also true for people in many careers where incident response is a primary job responsibility. Sometimes it prompts explicit acknowledgement, often it does not. I just think the world would be better off if more people’s legitimate pain and challenges were acknowledged.

Third point: I took special care above to say “responsible to”, not “responsible for” or simply “responsible”. I’m certain, given Atlassian’s size and the impact of this outage, that many of the people involved in recovery efforts played no role in the mistakes leading to the outage.

In my own anecdote, I played no role in causing the incident. I have to walk a fine professional line here because I have no intention or desire to criticize anyone else involved. I think I can reasonably say this. I did what you described:

> see it as my responsibility to make sure it doesn't happen in the first place, and when it does, my responsibility to fix it without complaint.

Even so, I’m experiencing chronic pain years later. And now I will complain: it sucks being in pain every day for years. I wouldn’t go back and do anything differently as an IC, except perhaps to tell past me when to slow down and that I have more ability to affect short term prioritization than I once realized.

Lastly,

> praising the staff

I sincerely hope this hasn’t had the negative impact on you which you’re describing as hypothetical for Atlassian customers generally. If it has, I hope you’ll hear me when I say any sharp tone in this response is not from lack of solidarity.

But if it hasn’t and you’re replying only out of work ethic and blame-placing: kindly go re-read my comment and recognize that the only praise I expressed was for the humanity of human people whose experience I wish to include in the conversation.


Thanks a lot for sharing the experience, especially here in a public forum. Not an easy thing to do. I only have pure respect for you, and am grateful for you sharing the experience. Sorry for my lack of clarity, I intended for my comment to regard Atlassian and in general, and definitely not to downplay anyone's specific situation.


I was thinking of this - working on this incident sounds like an exhausting amount of work for so many people.


[flagged]


"what about"-isms don't remove the experiences of people with jobs less stressful than rail clearing specialists.


I'm curious about your take on the cheapening or dilution of words in language nowadays. One isnt hurt but traumatized, one isn't "once bitten twice shy" but suffering from PTSD. It's assault if one gets in a fight and so on. I sometimes feel we're running out of language to distinguish the truly extreme from the quotidian. Same thing with how GP phrased it. A hard week of work comes across as having gone to battle and being left with deep wounds.

Still, I do mostly get your point.


You’re projecting a whole lot of language that’s not in use here so maybe you should consider that.


> nowadays

Any argument that uses this qualification should be accompanied with data.

You haven't specified anything, not even a rough time frame. Your statement is so vague nearly anything can be projected onto it.

Also a sidenote: if you're going to make an argument about language and definitions it seems like a good idea to actually know what PTSD is before using it as an example.



My comment was vague because my feelings on the topic itself are vague. I'm not married to the position and was only trying to spark a line of thought.

That said, nowadays would be social media mass adoption. Circa 2000 for the sake of thought experiment.

Also, what is the definition of PTSD and how did I use it incorrectly? For context, I once dropped a heavy duty suitcase on my foot when the handle gave. Now I always think picture the worst when carrying it. Is that PTSD? I would suggest not, because the extent of the trauma is much lower than a war veteran. That was what I was trying to get at.


> Is that PTSD?

No one here who’s qualified to answer that is going to give you a definitive answer here. At best you’ll get pattern recognition from either patients or practitioners nudging you towards consulting a professional.

> I would suggest not, because the extent of the trauma is much lower than a war veteran. That was what I was trying to get at.

You probably shouldn’t speculate about PTSD. You seem to have preconceived notions about what qualifies that aren’t consistent with actual people who experience it. I won’t speak for the people in my life who do, but few of them have ever been to war.

If you want to know more I sincerely encourage you to speak to a professional. They’ll have much better insight than any questions on HN.


Okay you’re actually just being a jerk. I’ll take my two years of chronic pain and go live for that excitement. But you can go take your gob’s sakes and shove them.


While the post-mortem is thorough, it misses key details on what companies experienced who were unlucky enough to be caught out by this outage. For example, it fails to mention how impacted customers lost access to certain Atlassian services for up to ~2 weeks: JIRA, Confluence, OpsGenie. But not others like Trello or BitBucket.

Of these, losing access to OpsGenie for this long was a massive problem, dwarfing most other systems. OpsGenie is like PagerDuty in the Atlassian world.

I spoke to several engineers at impacted companies who could not believe their incident management system was “deleted” and had no ETA on when it would be back, or Atlassian could not prioritise restoring this critical system ASAP. JIRA and Confluence being down was trouble enough, but those systems being down for some time was things most teams worked around. However, suddenly flying blind, with no pager alerting for their own systems? That is not acceptable for any decent company.

Most I talked with moved rapidly to an alternative service, building up oncall rosters from memory and emails - as Confluence which stored these details was also down. Imagine being a billion dollar company suddenly without pager system: and no ETA on when that system would be back, your vendor not responding to your queries.

I talked to engineers at such a company and it was a long night to move rapidly over to PagerDuty. It would be another 7 days they could get through to a human at Atlassian. By that time, they were a lost customer for this product. Ironically, this company moved to OpsGenie a few years before from PagerDuty because OpsGenie was cheaper and they were on so many Atlassian services already.

The post-mortem has no actions on prioritising services like OpsGenie in reliability or restoration, which is a miss. I can’t tell if Atlassian staff are unaware of the critical nature of this system or if they treat all their products - including paging systems - as equals in terms of SLAs on principle.

Worth keeping in mind when choosing paging vendors - some might recognise these systems are more critical ones than others.

I wrote about this outage from the viewpoint of the customers as it entered its 10th day and it was discussed in HN, with comments from people impacted by the outage. [1]

[1] https://news.ycombinator.com/item?id=31015813


As a customer I would not buy or use an Atlassian product in a 1000 years.

14 days without pagers ...

From friends and people I know I heard nothing good about Atlassian products. And that was before that 2 weeks downtime.

It looks like the products are duct together with duct tape, spit and a little bit of dirt.


> 14 days without pagers

There’s an interpretation of that where life is great


It should be shouted from the rooftops: don't switch services just to save money if the result is potentially worse business outcomes. Why save a tiny bit of cash if it puts your business at risk?


The saying "Cheap is expensive" comes to mind


Also "don't put all your eggs in one basket"

This is not very far away from AWS hosting their own status pages.


Presumably no one who made that decision is fucking stupid and thought they were putting their business at risk. Good lord.

I wasn't affected by this in the slightest and just found out opsgenie exists from the parent comment but even I can understand that this decision would almost certainly be driven by things like "we're already using atlassian for everything else and will benefit from the interop" and "we already trust them with everything else and they haven't let us down or we wouldn't be using them for any of that stuff either."


There's no need to get that upset over it...

When we switched to OpsGenie it was also to consolidate billing with other Atlassian products, but we all knew PagerDuty worked better and that OpsGenie was still pretty new and rough around the edges. We certainly didn't gain any business advantage by switching, but we did have to do a lot of work to switch which took away from other things we needed to get done. But ultimately we had no say because somebody just wanted to save a little money. I also doubt that there was a thorough vendor assessment before we picked it up, since it was a vendor we already used.


In my experience, the decision is driven by a VP who wants a bullet point on their year-end review, namely "notional cost savings achieved".


In my experience there isn't much interop that's worth anything with OpsGenie. It's a different story with JIRA/Confluence.


But you don’t need a paging system if your services don’t go down. Isn’t the fact you can’t deal without one for two weeks an indictment of your own practices?


Well said - you don't need to write tests for your code if you don't write bugs!


That’s not equivalent. Code changes every day (potentialy quite extensive), but the same thing is not true for infra, which hopefully stays mostly the same from day to day.

If your incident reporting is down, hopefully you completely stop changing anything about your infra.


It's naive to expect that things will just stay fine and dandy if you "stop changing" your infra. Consider scenarios such as traffic spikes, network outages, and under-provisioned resources.


Sure and you don't need emergency locator transmitters if your aircraft doesn't crash.

When you're ready to prove that your services "don't go down" send me an email and I'll come work for you.


Even if they do, I wouldn’t want to work for them


I can't say that I've ever been a fan of Atlassian or their products, but this blog post makes it sounds like they've at least learned the right lessons from this:

1. Establish universal "soft deletes" across all systems.

2. Better DR for multi-site, multi-product incidents.

3. Fix their incident-management process for large-scale incidents.

4. Fix their incident communications.

Regarding #4 in particular:

"Rather than wait until we had a full picture, we should have been transparent about what we did know and what we didn't know. Providing general restoration estimates (even if directional) and being clear about when we expected to have a more complete picture would have allowed our customers to better plan around the incident....

[In the future], we will acknowledge incidents early, through multiple channels. We will release public communications on incidents within hours. To better reach impacted customers, we will improve the backup of key contacts and retrofit support tooling to enable customers... [to make emergency] contact with our technical support team."


I tend to agree.

However, there's one point that makes me skeptical: there are no organizational changes, or changes to leadership, or anything in that direction.

This sounds like "the tech guys screwed up, culture and management is fine here". Which it might be, or it might not.

I would have loved to see

5. We will stop pushing customers so hard towards using our cloud

for one, but that wouldn't be convenient for Atlassian.


Note that the ToS also forbid Cloud users from “disseminating information” about “the performance of the products”.

So you can’t say it’s slow or unperforming.


I thought, in terms of Jira and Confluence at least, it was just accepted that being slow and underperforming was the status quo and if it was running at a speed you'd consider normal, that's an exception (and cause for alarm... like "did that actually save or is there a silent JS error not being displayed?").


Sure, it’d be extremely hard for their cloud offering to be slower than our on-premise installation.


Well that's some Oracle-tier shit.

(ref: https://danluu.com/anon-benchmark/)


Not just benchmarks, Oracle also sues (or threatens) security researchers, vendors, and of course customers for less.


Is that real? That is so bad. Queue up next “Shall not disclose or notice our incompetence” clause.


https://www.atlassian.com/legal/cloud-terms-of-service

> 3.3. Restrictions. Except as otherwise expressly permitted in these Terms, you will not:

> (i) publicly disseminate information regarding the performance of the Cloud Products; or (j) encourage or assist any third party to do any of the foregoing.


That should tell you everything you need to know about atlassian as a company and about how good their cloud product is.


I believe these clauses are usually there to prevent competitors from using (sometimes misleading) benchmarks in their advertising.

Used to be common to see this kind of comparisons between databased for example.


Link for reference: https://www.atlassian.com/legal/cloud-terms-of-service

> Except as otherwise expressly permitted in these Terms, you will not:

> ... (i) publicly disseminate information regarding the performance of the Cloud Products;


By keeping the management you have a chance they learn a lot from that unique experience. Changing leadership would be the PR move.


You can change leadership, but if culture is the problem, then you're out of luck.


I wouldn’t expect the larger question of what organizational and management issues lead to the problem in the larger sense to be made public. At least not while the issue is still fresh. Maybe down the road as a business school case study after some of the players retire.


I am actually pleasantly surprised at the openness of this response, and their taking responsibility of mistakes and detailing what will change in the future. It's not just corporate speak. I think that speaks well for the company and it improved my view of them.


Why do you assert that this is not just corporate pr?


Of course, this is a part of public relations for a corporation.

But usually when we people talk about "corporate pr" they talk about weasel words, non-committal statements, and deflecting blame away from them. I think this PIR does a decent job of acknowledging the mistakes they make and how they can improve.


> During this incident, we missed our RTO but met our RPO.

You missed your recovery time objective by ~2 weeks and did not properly communicate the issue from senior management until around a week into the outage. It is great to hear about how the company plans to do better, let's see how the next outage improves things.


Noting that they met their RPO is technically correct but tone-deaf.

Nobody cares that you lost very little customer-submitted data if customers rely on you (mission-critical) to continue to accept data, and the outage prevented that.


That's simply not true, its just not good enough - if you lost customer data as well people would be livid.


I can only imagine the way the person who pulled the trigger on the deletion script felt the moment they realized what had happened.

I’ve been there with much less significant incidents when a “routine” change turned into a potentially resume generating event. It’s not fun.

Ultimately the responsibility is with the organization that made it possible for an event of that scale to happen rather than the individual person who happened to trigger it but that doesn’t make it feel any better.


As someone who observed this particular incident from the inside (holy shit-balls have the last 3 weeks not been fun), one of the few positive elements of it has been the universal and effectively instinctual agreement internally that it was a massive screw-up in the system that we all have to own, rather than one or several individual screw-ups that need to be put at the feet of individuals.


Well, the one constant of IT is 'shit happens'; mark this one down as something interesting you've seen.


Thomas J. Watson famously said:

Recently, I was asked if I was going to fire an employee who made a mistake that cost the company $600,000. No, I replied, I just spent $600,000 training him. Why would I want somebody to hire his experience?


This quip never made much sense.

Mistakes almost certainly follow some kind of pareto distribution instead of it being evenly distributed.

That's why 2% of doctors are responsible for 39% of malpractice lawsuits. If your doctor cut off the wrong leg and you sued (and won) $2 million, would go back to the same doctor and just chalk it up as "well, now he's got $2 million of training"?


> This quip never made much sense.

I see it as: Your employee just learned a valuable lesson, and you definitely don't want to hire another new employee to make that same mistake again.

Ultimately, it's the owner taking responsibility for the fuck up because they hired the employee, but still standing behind their decision to hire the employee.

If your employee is an idiot and makes up 39% of mistakes, of course you won't keep them around when they make a big one. But most employees are not fuck-ups, as you alluded to with 2% of doctors making a majority of malpractice suits.


Lol do you should keep an idiot who shot a (looks at notes for what’s as cheap as 600k in the army) a javelin missile while drunk? It’s prudent to apply the concept of nuance. If some intern deleted all these accounts you should fire the guy responsible for letting things free enough for the intern to do it but you should fire them. Whatever lesson learned could be learned from a wiki page detailing the incident.


I was on call “back in the day” when a customer rang to tell me that she’d meant to type:

    # find /tmp -exec rm {} /;
But instead she’d typed

    # find / tmp -exec rm {} /;
I learned a lot about Unix (and, as it happens, X windows) file systems that week. Note that this was long before tools like sudo existed, where cleaning up tmp was often a manual process.

There is no way we could blame a person for a single keystroke error - it’s crazy that the command worked - but I now look at the find command with an enormous amount of fear.


I ran into a similar one a few times from the same team that kept doing things in their scripts like

  rm -rf ${basedir}/*
without setting

  set -u
to exit if there are undefined variables. I suggested a few other sanity checks but was told I was adding friction.


Good old days deleting an entire server ... learned that the hard way.


I think in this instance it’s the person who sent the IDs that feels worse. The deleters were provided IDs that had 30 correct app IDs and the rest were site IDs.

Like they mentioned they had a delete script that worked for all types of unique IDs so that also can dilute the feeling of ”it’s all my fault” hopefully.


That moment when you see DELETE 18388272773 0


Reasons why I don't enter writing queries without starting a transaction first. ABORT, good sir, ABORT.


Oooof. Passing in Application IDs will delete applications and passing in site IDs will delete sites. That's a really really bad design. I'm bookmarking this so that I can use it as a showcase going forward.

Just this week, I changed a spec in one of our proposed endpoints that did exactly that. We passed in ids of various types of objects to perform actions, and I changed the api so that it would be forced to pass in a struct that contained an object type and object id. Explicitness is so much safer in the long run, especially in enterprise apps.


This reminds me of how useful it is to make your ids include the type of the thing they're identifying as part of the value. This could be as simple as something like "u<random digits>" for users, "p<random digits>" for projects, etc. Or it could be a full-blown URN scheme like some of the big cloud providers use (though please, whatever you pick, BE CONSISTENT about using that form and only that form. If there's an operation to exact "just the id" from your larger value, and you need to use the full thing in some places and "just the id" in others, then you've missed the point).


Yep, this was a situation where humans manually passed IDs between it other, so it would have been great if those IDs were maximally human-readable.

Crockford’s 32 encoding with a domain prefix works well enough.


And when it comes to user interfaces for deleting, have it repeat back to the person how many, and what is being deleted.

And have them type out both the number and they thing they are deleting.

https://rachelbythebay.com/w/2020/10/26/num/


Do any databases have something like native support for soft deletes or ability to undo (other than SQL transactions rollbacks where you're having to specify the undo checkpoint)? Something like what Git does where it keeps a history of edits? If this isn't common, is this a neglected area that should be addressed or it's just too hard of a problem? It feels like with SQL, there's minimal guardrails and it's just your own fault if you're not extra careful, compared to say using Git with code or using "restore from trash" with filesystems.


So, the fully fleshed out form of this in databases is usually called Bitemporality. The official SQL standard has included this for some years now, but it's not widely implemented by databases.

An intuitive way to think of bitemporality is it's like MVCC, but with 4 timestamps per row version. One pair describes a range of time in "outside" or "valid" time, ie whatever semantic domain the database is modeling, the other pair describes a range of "system" time, which is when this record was present in the database. This lets you capture and reason about the distinction between when a fact the database models was true in the real world, and when the database was updated to reflect this fact (some people call this "as of" vs "as at"... the terms here aren't fully settled but the basic distinction is). So you can revise history, do complex time travel queries, all sorts of stuff. It's a very useful model that directly aligns with what sort of questions businesses need to answer in the context of a court case or revising their ground source of truth due to past bug or error.

The downside is your database balloons with row versions, and many queries become far more complicated, perhaps needing addition joins, etc. Also from the perspective of database implementors there's a ton more complexity in the code. So that's why it's not widely supported despite the standard.

There's also a niche of databases built around this model from the ground up, usually based on Datalog instead of SQL. There's also overlapping work with RDF and Semantic Web thinking (as awry as all that went).

In practice how most organizations address this is operationally, by keeping generational and incremental backups that let them restore previous database states as needed. Though as the original post we're here for proves, that kind of operational solution can bite back hard when it goes wrong.


In MVCC systems like PostgreSQL, if you don't vacuum (garbage collect the old tuples), your database is append-only and you can query as if your transaction was started at some time in the past. I don't know how to set auto-vacuum to have a fixed delay, e.g. keeping 24h of changes, but I bet it can be added if it's not built-in.


How do you set the time back for a query? How do you specify what txid to use for a point in time query?


This would just be equivalent to rollback to a checkpoint in time of the whole table. I think the question was more on a row-level.

If you are just interested in global time traveling, there are many solutions, such as replaying the oplog from snapshots in time or delayed replication.


Maybe I'm misunderstanding but if you know the txid at the time you're looking for then you can find the value of any specific row at any point in time using xmin and xmax at the row-level (if you're not running vacuum, as the parent suggested).

Am I mistaken?

The only problem is that you need to keep a map of timestamps to txids so you can find the txid that was valid at a particular moment in time. This doesn't sound to me like a significantly difficult problem, but maybe I'm mistaken. That said, it's not like you need super high time resolution for the use case in question.


Lots of people have updated_at time stamps on all tables, so you could probably inspect those to find your way back. I’ve never tried to query the history implicitly hidden in Postgres tables so I’m not sure how possible (or sensible) any of this is.


I don’t understand why you’re being downvoted and I hope someone will explain.


Datomic treats all data as immutable, so it can wind back to any version.

When new data is written, the entire block is copied and rewritten rather than changing the data in-place.


This is similar to SQL checkpoints in that a rollback is all or nothing. It wouldn’t be segmented by tenant unless the each tenant has its own transactors.


The difference is that, unless you’re using a SQL database with direct/extended support for write-once immutability or a data model designed for it, once the transaction exits every successful committed change is ~final.

A write-once immutability design (whether in the DB itself, or an extension, or a userspace implementation) lets you reconsider and rework mistakes after they’re committed, not unlike how you might do with git rebase etc.


That would be difficult to use given requirements like GDPR's user data deletion.


Real world immutable systems usually support the eventual garbage collection of soft-deleted or superseded records. No one actually wants disk usage to grow unbounded forever. It just means you are not going to support "overwrite immediately" semantics in ordinary application code paths.

Datomic describes its capability here: https://docs.datomic.com/on-prem/reference/excision.html


Dolt is a SQL database that implements Git primitives at the storage layer, including commit, diff, and revert. We wrote a blog the other week summarizing how these features can serve as guardrails:

https://www.dolthub.com/blog/2022-04-14-atlassian-outage-pre...


Snowflake has time travel[1] keeping the original data for specified period of time.

1: https://docs.snowflake.com/en/user-guide/data-availability.h...


Generally a database that allows you to revert only allows you to revert the entire database. Bits of information are often related to other bits of information, and if you only undo some of the changes, then you end up with inconsistent data (dangling references) and destroyed data (overriding information more recent than what you are restoring). So soft deletes tends to become a domain specific problem rather than a problem that can be solved with better tech.

I'd argue that SQL has some pretty big guardrails. 'DELETE FROM CUSTOMER WHERE' will generally fail, because there is lots of data referring to the CUSTOMER table and the system will insist the data remain consistent.


WAL archiving / point in time recovery can help with this.


This is rather easily achieved using additional tables and triggers, so whenever you update or delete a row, the old version is written to the additional table. It requires some work (writing triggers, duplicate table definitions, etc.), but it's not that hard and the result is worth it.


Quite a few databases support time travel queries, in particular Oracle has for years and CockroachDB has them also. We can query the state of a table as it was at any point in the last 72hrs.


I am very curious if they used a Jira board during this crisis for issue tracking. Because then they would have more than 4 lessons learned.


What you're basically suggesting is that feature development at Atlassian moves at such a glacial speed because of course they're using Jira to manage it. This kind of blows my mind right now.


I chuckled hard hahaha


From the article:

"To manage the restoration progress we created a new Jira project, SITE, and a workflow to track restorations on a site-by-site basis across multiple teams (engineering, program management, support, etc). This approach empowered all teams to easily identify and track issues related to any individual site restoration."


I read "empowered" there and thought, what a ridiculous misstep. They can't even stop themselves from selling their product in a document that's supposed to be a mea culpa, where you can guarantee that many of the readers are pissed off and absolutely not receptive to your marketing.


As I said in me other down-voted comment and yours is proof for the same - these post-mortems are first of all PR and second - everything else. And lots of people pointed out why they don’t constitute invaluable knowledge - because they lack the really important stuff or substitute it for marketing messages…


Good thing they acquired Trello.


Implementing soft deletes is a lessons every developer learns early in their career. The fact that Atlassian did not implement that in their cloud, is mind boggling.

Great case study of a monumental fuck up!


So many opportunities missed to avoid this! Look at one ID and ensure it is what you expect it to be. Run the script in a dryrun mode. Run the script for 1 customer. Probably more!


This was addressed in the write up (it’s very long, so missing it is easy). They ran the script against 30 accounts first to verify it worked, and it did, because the list of 30 ids they tested against came from a different source than the other 750ish. It’s a shitty mistake to make but I’m certain I’ve made similar ones.


One of the favorite tricks tricks I've ever seen is how Twilio uses human-readable prefixes[0] on their various identifiers - you will never mistake a device (HSxxxxxxx) for an account (ACxxxxxxxx). It's prevented us (Twilio customer) from making similar mistakes in the past.

[0]: https://www.twilio.com/docs/glossary/what-is-a-sid#common-si...


I like the idea of human readable identifiers. But generally feels like this class of error could be prevented with more type safety in the api and data model? Like deleteDevice(123) and deleteAccount(123), rather than delete(123). This is how REST is designed, the type of resource is already baked into the url.


Those go hand-in-hand. Tell me if deleteDevice(123) or deleteAccount(123) is wrong, opposed to deleteDevice(A123) and deleteAccount(D123).


And don’t make a “universal delete” script in the first place…


So, what do you think of terraform?

It's puzzling to me that we, as software developers, spend so much efforts trying to automate such one off deletion tasks, and the automation would inevitably go wrong and result in data loss.


After GDPR almost all companies need to have something like a “universal delete”. There are safer ways to deal with data retention policies but I can understand why such a script exists.


For large systems adding Soft Delete would be more a PR move, then actual work. In some projects I know implementing soft delete is as hard as writing project from scratch. Having scripts and manual restoration for particular records is the way those businesses communicate with customers.


Soft deletes are great, but are probably not sufficient to meet the GDPR's "right to be forgotten"


You can do a soft delete, setting a flag (expire date), have apps ignore records with an expire date then a scheduled job to delete all items with the expire date set with a value less than current time.

Alternatively move the deleted data to a temporary location and then delete the temporary location after a short period of time.

Or better combine both patterns where expired rows get moved to a temporary location before hard deleting a period of time after.

GDPR says you have to delete data when requested. As far as I know you have 30 days to acknowledge the request and up to 60 days to action the request. It’d be completely reasonable to do a soft delete for 7-14 days before doing a hard delete to prevent these kind of errors.


That's exactly what we do: a soft delete followed by a scheduled hard delete 7 days later. The customer is also notified that they won't be able to restore their account after that. Soft deletes are usually automatic and well-tested (for example, a subscription expired) but sometimes there are requests by management or customers to delete accounts manually and this approach really helped avoid disastrous situations like the one at Atlassian because you have a whole week to realize you deleted the wrong accounts (customers will most likely complain much sooner).

Yes, GDPR has a grace period of 30 days or so, it's never been a problem in practice.


Does this apply for B2B companies like Atlassian?


Yes. As I understand it, GDPR does not care if customer data belongs to a business or an individual. And even if it did, business data will likely include PII for employees, which would need to be deleted.

GDPR does give you a grace period , so you can soft delete immediately and then hard delete after some period shorter than the GDPR deadline. However, actually implenting such a system can be rather difficult and potentially expensive.


There's no reason you can't have both. Use soft deletes for everything except for a formal GDPR right to be forgotten request (or any other compliance situation).


Soft deletes are just the first step. After some period of time you can automatically purge or manually purge. Which is what they’ve supposedly committed to doing.


> except for a formal GDPR right to be forgotten request (or any other compliance situation)

That "any other compliance situation" includes things you probably want to have a soft delete for, like deleting/closing an account. Customers accidentally deleting their accounts, then wanting it restored happens more frequently than one would hope.


For anyone wanting to attack:

* hindsight is 20/20 - it's easy to lecture after the fact about how mistakes could have been avoided

* modern software systems are very complex

* have you never made a mistake?

It is however noteworthy that they have done the wise thing from a publicity perspective, which is post this on a Friday, hoping that by the next tech cycle more interesting things will have happened for the press to report than a rehash of this outage. That's politics.


>The script that was executed followed our standard peer-review process that focused on which endpoint was being called and how. It did not cross-check the provided cloud site IDs to validate whether they referred to the Insight App or to the entire site, and the problem was that the script contained the ID for a customer's entire site.

Yup that deletes something… anyway…

> Establish universal "soft deletes" across all systems.

It’s just easier that way to observe what might happen.


Soft deletion always feels at odds with privacy-related "right to have data deleted" laws.

Would be super interested in a technical writeup on how they do this.


"Right to have data deleted" can be 'circumvented' if the data is critical part of the system or is needed for legal purpose (for example it can be mandatory to keep 1 year of IP logs and data associated with it)

In previous companies I have worked for, we did instant soft-delete, then hard anonymisation after 15-30days and then hard delete after a year. That means the data was not recoverable for customer but could still be recovered for legal purpose.


There's a time period before which you need to permanently delete the data. A soft delete will allow you to delete the data quickly and you can see what happens. If everything is okay you can then purge your database of all soft deleted data.


It shouldn’t be. These laws at least have the nuance to understand that data can’t be immediately deleted from Backups and that in such instances where deletes are complicated the customer is notified.


IANAL but the laws have carve outs for backup retention, etc.

A simple technical solution is to store all data with per user encryption keys, and then just delete the key. This obviously doesn't let you prove to anyone else that you've deleted all copies of the key, but you can use it as a way to have higher confidence you don't inadvertently leak it.


Ideally they'd encrypt the customer content with a key provided by the customer and destroyed when the customer requests account deletion. The customer would still be able to use their key to decrypt backups that they get prior to the request. If the customer changes their mind, they just upload the key again (along with the backup, if necessary).

Of course, this means trusting Atlassian to actually delete the key on request, but there's not much reason for them not to.


Restoring data from backup is the most common data recovery technique. Lots of information there to start from if you are interested in how data recovery relates to privacy laws.


> The API used to perform the deletion accepted both site and app identifiers and assumed the input was correct

Hopefully they also change their API, so these two very different things don't use the same API call.


From my PoV that would be the most important lesson: preventing the disaster is equally important to mitigation of consequences. Good API design is cheap, why not fixing it first?


If I understand correctly, since they deleted only a small subset of all their customers, they could not restore a clean backup of those customers without losing data from other customers not impacted by the outage.

So how would one have a clean "partial backup" strategy if something similar would happen in his company?


(or her company)

Depends a lot on the situation and technologies already in use. For example, if you lost ticket sales, you might monkey patch it by having two systems at the entrance: one with the main data, one with the restored backup. If the person isn't in main, you can check backup. That could be deployed more quickly than trying to consolidate the two states.

In another situation, you could isolate customers so that a delete and a restore simply affects everything at that customer and such a partial delete is not a problem. You could still have trouble if there is a partial delete within a customer system, but restoring part of 1 company is a lot less work than restoring parts of hundreds of companies.


> The API used to perform the deletion accepted both site and app identifiers and assumed the input was correct

Yet another case where using types would have prevented a massive problem.


How do you imagine types would prevent this? No matter how I think about types in this context, I think a runtime check of "is this ID an app ID?" is mandatory in order to truly prevent this (alternatively- different URLs/parameter names to delete an app and a site)


Yes, you're certainly right that this check would have to happen at runtime (unless Atlassian have written one big project which is all checked by a compiler, which I highly doubt).

How I imagine this to work would be that the team who wrote the system (function, script, whatever) for deleting things would encode it in the types that their deletion system would only accept specifically app IDs, rather than both app IDs and site IDs (which I think reflects the specifics of their post-mortem).

In order to get a value into the deletion system for processing, the ID value would need to be parsed into the specific type that the deletion system accepts.

This is of course similar to just saying there should have been validation, but I think conceptually, validation and parsing into a narrower type are two different things. Gary Bernhardt calls this "functional core, imperative shell". Michael Feathers calls this "edge-free programming". Probably the best literature on this difference is by Alexis King, here: https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...


> Prior to this incident, our cloud has consistently delivered 99.9% uptime and exceeded uptime SLAs.

But it had the risk of catastrophic failure the whole time. I wish we had better ways to measure and communicate risk.


That risk is always there.


So what else is being done for impacted customers? Are they getting money back? What's Atlassian's SLA for these services and what can be recouped under it?


> There was a communication gap between the team that requested the deletion and the team that ran the deletion.

Big red flag. If two teams own something, no body owns it.


Massive props to them for releasing this but dropping it on Saturday morning seems calculated and cynical.


Post-incident reviews are really a means for (black) PR and nothing more than covert advertising. Few people if any care at all about these things in the busy landscape of devops and whatever. Eventually everyone is let-go, because every sane software engineer knows that software works by magic on a very large scale and outages and incidents are everyday life.


I wish they could assure us the engineer who pressed the button wasn't fired.


> Atlassian is proud of our incident management process which emphasizes that a blameless culture and a focus on improving our technical systems ...


Sure. But I hear they still only offer one type of cheese....


What apps to create that timeline ?


Many of the projects I worked on would have a near-identical replication of an environment from the network stack to the application and databases. Flipping from staging to production was sometimes as simple as a DNS update. It's always an eye opener to see businesses at this scale operating without a full replication of production in staging. It's always harrowing when you're testing a destructive change on dummy data just knowing there's a million ways a live deployment could go wrong, and the impact just scales the bigger you get so that kind of redundancy just seems even more important.

It could be argued at the scale of a company like Atlassian that this level of redundancy is prohibitively expensive, that's a lot of databases and files to have sitting around doing nothing 99.99% of the time, and it's hard to argue for prevention of something that's never happened and would be a costly thing to tool up for. But you can definitely factor scaling your redundant capacity into your model, both pricing-wise and engineering-wise. It's not like Atlassian products are cheap to begin with, I'm sure they can sustain some velocity / bottom line hit for the sake of something as basic as fully replicated staging environments. I definitely don't think this is on the engineers, it's a strategic oversight and shows where ultimate priorities lie within the company.

Putting your trust in a cloud service to take care of things you'd otherwise have to worry about yourself is a major decision, and safety is one of the top priorities of basically every user, and seeing the lack of process and glib approach to staging is a major red flag.

Anyway that aside I do appreciate their detailed write up and it does feel like a bluntly honest and truthful disclosure. That goes a long way to restoring trust, but it does also expose some of how the sausage is made and it's clear some of the ingredients are questionable. It does bear the hallmarks of a small successful software startup hitting the big time and scaling with acquisitions faster than supporting processes can safely scale; they have a team of engineers and it's up to them where to engage them and it seems being able to do proper dry runs of destructive changes wasn't seen as more valuable than getting more services on the products page.

Hopefully they'll act on the recommendations of the report and implement the improvements they said they would and not just refocus their efforts elsewhere once the spotlight moves on. I'd like to see regular updates on this as a long-term Atlassian user as it would factor greatly into me recommending Atlassian products over other stacks in the future. They could easily set up a public Jira / Trello board so we can keep track of progress on these promises.

Obviously this is not unique, these mistakes have happened before, so it's not just the kind of stuff that seems obvious in hindsight. I am sure there were engineers highlighting these issues internally but scaling redundancy is never as sexy as onboarding a new product and adding its customers (and revenue) to your quarterly reports. Hopefully the reputation hit is a stark reminder to the c-suite that yes, they are running a technology company, and that means that technology and engineering should be just as important as growth and penetration.

Anyway, good on them for being open. Well done to the engineers who worked to untangle the mess, good on management for allowing this level of transparency and taking ownership, things could have been a lot worse by the sounds of things.


ha the report doesn't note "most of the company was partying in Vegas when it happened"


How is that relevant?


Seriously, where were the -24hr backups that could be rolled back to once its clear the script is fubar, or using it on just 10% of the estate first? ...


It is never that simple. Say the backup existed, and was global. By the time you get everyone briefed on how fubar it is and get agreement to load the backups, there are hours of changes from the unaffected customers that will be wiped by the restore, or have to be reconciled by hand for months. Sure, you can concoct the perfect antidote with hindsight, but their retro and next steps are sound.


Build it properly. By definiton of following best practice IT IS or always should be not far from being able to follow this. If its not someone is to blame


What it means is that they never tested their disaster recovery system, because this would have been found right away. Or, someone would have reported it and an upper level exec would have signed off on it being okay to take 14 days to restore a small subset of users.


Again, not that simple. The customer restore procedure was almost certainly tested(and in active use as customers blow up their own data often enough). It was _not_ tested on 800 customer stacks simultaneously, as that was considered a sitewide disaster by whoever dreamed up the failure modes to test for. Meanwhile the actual whole site disaster restore plan may or may not have been tested, but it was useless for this case since some customers were unaffected and would be damaged by the whole site plan.


Then we are in agreement. Even if it were the case they had to do 800 separate recoveries (which I vaguely remember reading they couldn't do it individually), it means they never tested a large scale recovery situation and had no idea that 800 recoveries would take 2 weeks. That's a significant issue that they should have tested.


That lesson really stuck out for me also. My definition of “restore” has been too simplistic.


Not really. They managed closer to "undo" the break, which still makes it look like their disaster recovery might also be un-tested.

I'm sure there are many companies who would lose 24hr of tracking to get back online in a few hr vs 14days. In the real world with deadlines and commitments this is madness.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: