Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
NYSE Tuesday opening mayhem traced to a staffer who left a backup system running (bloomberg.com)
278 points by helsinkiandrew on Jan 26, 2023 | hide | past | favorite | 224 comments



It's easy to throw shade when Bloomberg writes an article that puts the blame on "a staffer". Having worked near some of these systems, the engineering and process are actually quite good. How many companies publish their private network topology, service p99.9 in microseconds, and detailed pricing on the open web? They're in a painfully competitive global market that's ambivalent to names on buildings.

In a week or so there will be a comprehensive internal post mortem, and every engineer in the company will read it because that's why they work there. "The staffer" will not be named, nor will they be fired. The process will be changed. The systems will be changed. You probably haven't heard of Pillar, but the NYSE in your head was replaced by some pretty amazing, distributed, low latency systems. The culture is to over-engineer, over-provision, plan for black swans. And test. That it works. Test that it scales. Test that backups work. Test, test, test. firmitatis, utilitatis, venustatis. This failure was due to daily testing.

Sometimes things still fail. That's true anywhere. In most places your failures don't make the papers, and accidents are swept under the rug. That doesn't happen at NYSE for obvious reasons. They're not building large language models (that I know of), or self driving cars (pretty sure on this one), but they're a modern, cutting edge, "soft" real-time engineering shop. If you haven't looked already, you might find something interesting there: https://www.ice.com/careers


I worked there and I can say that this is not accurate at all. It is very much a blame culture. I've seen people fired for less severe incidents. Beyond the core technology of the Pillar engine, the place is not comparable to a modern tech company in almost any way.


As somebody who worked with them as a client, I can confirm this. There is currently a spec-level bug with their core Pillar engine and it was essentially bounced between several different teams and ultimately ignored as nobody's problem.


Unlike all of the "modern tech company" problems which are never ignored and only solved when someone's problem goes viral on social media.

They're a big company, some groups are better than others, some customers get more attention than others.


So basically like any other medium to large company? This doesn't sound unique in the slightest.


I would think that the company being a securities exchange would factor into the analysis. Don't you?


How does them being a securities exchange in any way affect the analysis of their software engineering practices? They're not some special snowflake, they can suffer the same software engineering and business process issues as other companies.


>> They're not some special snowflake

But they are. The consequence of a one-day or one-hour shutdown on their system is exponentially worse than most any other. I would expect them to have more rigorous systems, including more rigorous attention to development. Comparing the NYSE to any other business is like calling Fort Knox just like any other bank vault.


I disagreed with you until: "...like calling Fort Knox just like any other bank vault."

Interesting point that teeters on false equivalence. I think AWS or Azure might make for a better analogy. Your point identifies the inherent risk of actually operating a platform business. A bank vault is (mostly) synonymous with Cloud, in this context. If a vault is robbed or a cloud goes offline, losses extend beyond the business which inherently compounds the severity of downtime.

Linear loss vs. parabolic loss.


But if a cloud goes offline, there is damage to the economy linear to the length and breadth of the outage. Sure, there are losses to businesses serviced by the cloud's users, but they'll bounce back, even if a day-long outage was so severe as to temporarily ground flights and halt supply chains.

If a stock exchange executes trades at incorrect prices, even for a short amount of time, all of a sudden you're in a kind of non-linear sigmoid regime, where investor confidence can suddenly tip into panic selling and recessions can be triggered. Thankfully, that didn't happen here, but it could have. If you're going to give a company that power, you should better hope that they're held to higher standards than most dysfunctional tech organizations!


"If a stock exchange executes trades at incorrect prices, even for a short amount of time, all of a sudden you're in a kind of non-linear sigmoid regime, where investor confidence can suddenly tip into panic selling and recessions can be triggered."

This is false equivalence and slippery slope.


No company or organization is immune to bad business practices.

Them being a securities exchange does not somehow provide immunity from developing rigorous systems which have oversights, or make bureaucracy magically go away.

Likewise, the impact of an outage being more extreme does not mean the people there are infallible. Things slip through. Especially random customer requests being bounced around from team to team, the thing in question.


Then somehow we are in agreement. It is my impression that you were saying precisely the opposite, that why should we focus on NYSE's bad practices over others'? Well the answer to that question is, this is a thread about NYSE under a post about NYSE. And besides, we should be, if anything, more scrutinizing of their practices because of the outsize effect that disruptions in their services would have on the global economy.


> like calling Fort Knox just like any other bank vault.

Main difference being that most bank vaults aren't actually empty. ;)


No they aren't.

There's far more critical snowflakes out there... FAA Airspace management, a medical radiation device, avionics in an aircraft, and facebook.


I used to work for a small startup, and postmortems were truly no blame - engineers would talk about exactly what happened and wouldn't hesitate to put the blame on their mistakes.

But as the company grew, the postmortems became more about blame since now you're not blaming an engineer, but an entire team so singling them out isn't personal. The postmortems were no longer a single engineer describing what happened in his code, but were team leads talking on behalf of teams. They were all about shifting blame from your own team and talking about why a service from another team led to the problem, even if your team could have (and should have) been able to work around it without melting down.

I'm no longer at the company, but Postmortems are much more useful when they really are no-blame because you can get to the real root of the problem, but I don't know if that's possible in a large company.


This happens within big organizations that are large enough where they start having that internal small company feel within units. I would say a good program, which could be small or a chunk in a massive org, does a blameless post mortem.

A few years back, a task to modify an index was given to a scrum team. The lead was away and the senior people could not be bothered. The junior developer stack overflowed an answer, asked for review, tested the script and let it rip. She missed that the change deleted everything if you noticed. Every environment, every data center wiped out. 10B records in each prod instance. Lessons were learned and processes fixed. She was not fired, but rather became one of the people safeguarding the keys to our prod kingdom as we fixed out broken process. I stole her away as my first report when I switched groups.


Suddenly I don't feel so bad about deleting an entire PVCS repository (happily answering 'yes' to all the 'are you sure?' questions) at 4:30PM on a Friday.


As organizations become larger they become more political. It's unavoidable.


Having been in the industry for a couple decades, and having worked at both, they're not all that different. Some groups are going to be better than others in the same company. Some companies are floating on venture money today, and might disappear tomorrow. Most technologies constantly cycle. Our experiences working at the same company were different.


Blame cultures and process cultures are both problems in different ways. Blame cultures don't care about individual accountability, only that someone suffers. Process cultures only care that no one suffers, not that individuals are accountable. Both have some misguided notion that something other than personal accountability can lead to good results. Misattributed blame and suffering does not deter poor performance or mistakes. Not even correctly aimed punishments are very good at that. Accountability isn't about punishment, it is about limiting power to the level of responsibility demonstrated. Rules and procedures don't prevent poor performance, they can in fact entrench and guard it, and they only mildly impact mistakes. Best practice can mitigate mistakes to the same extent or better (due to easier adaptability), but people keep trying to turn them into rules, and that has to be fought. If you followed all the rules but didn't get the job done, you still shouldn't be handed the same task again, but not out of blame.


I worked for NYSE’s parent, ICE, and I have to agree with this. While there were many things I didn’t like about working there, the tech and the management weren’t involved in those things. A similar problem to this happened while I was working there, but it was on Endex and not NYSE. Management spoke with the responsible party, but he wasn’t fired and no punitive actions were taken against him. The blame game also wasn’t played. The team just decided to provide more eyes on the process, change the interface of the tools a bit, and move on. The company itself did face hefty fines for the screw up tho. Ultimately, the issues at ICE/NYSE are due to a highly bureaucratic structure and to onerous regulations forcing parts of that structure to exist. Given those two problems, I think ICE. does extremely well.


I've thought about getting into this... The stuff they work on is so incredible to me.

Here's a quote from their Pillar product page:

> Up to a 95% Reduction in Latency: The roundtrip latency on NYSE Pillar order entry sessions via Pillar matching engines has been reduced from ~592μs to ~32μs for FIX and from ~96μs to ~26μs for Binary, getting client orders into the market much faster. With a 92% improvement in the 99th percentile latency results, clients can also have more confidence in improved performance consistency regardless of market conditions.

Reading stuff like this makes my current work feel stupid by comparison.


> Reading stuff like this makes my current work feel stupid by comparison.

It makes our economic system seem stupid. Jesus, we're not calculating astrophysics or quantum mechanics. A made-up system should not require or depend upon this kind of speed or precision. Maybe we should chill.

Reminds me of those pro StarCraft players who keep unnecessarily clicking the mouse to keep their APM (actions per minute) stat high.


Absolutely agree. but "liquid markets are important" or something. rolls eyes

This is just another step in the endless journey of widening the gap between your average Joe and someone with access to high level financial services.


This is what enables products like robinhood to exist - platforms explicitly targeting the average joe investor.


wonder how they measure this and is this smart engineering from the exchange or just new fast network gear.


Just an educated guess (I'm in the same industry, have worked on some networking related stuff) But I think it is probably mostly network hardware and architecture. You can only improve so much from the code, the networking is where all the latency comes from.


Just naming a 'staffer' though seems to already be a way to apportion blame to a segment of the employees, insulating management from what was done. Named or not doesn't really matter, clearly blame is being assigned.


I think those are Bloomberg's words, or their paraphrasing of the grapevine. The high level people that I knew there weren't petty, and everyone was of the opinion that it didn't matter who clicked the button: we were all in the same boat.


Yea, it sounds like an issue with process and automation. It shouldn't have been possible for the staffer to make a mistake that would cause this.


Precisely. It is never just one error. At a minimum two and if you really stare at this sort of thing long enough it isn't rare at all to discover a whole chain of them. The only difference with all the times that it went right is that this time everything was aligned 'just so'.


Curious why the link to ICE.COM?


ICE owns NYSE and several other exchanges. ICE stands for Intercontinental Exchange. ICE is the IT administrator for these exchanges. I helped sell some router management software to them a while ago. ICE is fairly new, NYSE used to be independent. That changed sometime after 2008.


do they still do UDP packet loss replay over email?


Serious question: If someone is smart and capable enough to work on tangible things like AI systems or self-driving cars, why should they choose the NYSE outside of pure monetary reasons or affinity for a “modern” tech stack?


Wouldn't working with systems that keep the largest stock exchange for the largest economy in the world running where a simple mistake can cause "mayhem" when the market opens be considered more "tangible" than working in AI or on self-driving cars? It just doesn't have as much street cred as working on those particular projects in the tech community.


Not necessarily. If you're the type that's into finance, then sure, that might get you out of bed in the morning. I'm not into finance and kind stand the culture that surround finance. Yes, it's big and touches every single one of us, but doesn't mean I want to embrace it and go to work in it every day.

If I can take that same skill set and apply it to something with a much better culture surrounding it that affects people in a positive way, then I would definitely choose that over finance any day of the week and twice on Sunday.

At the end of the day, if the NYSE did not exist, the world would continue to turn. It's just not that big of a deal to a heck of a lot of people.


>if the NYSE did not exist, the world would continue to turn

This is startlingly ignorant of the complex machine that is the modern economic system. If something like the NYSE was to shut down today it would be pandemonium.

There is a difference between 'I don't understand how something works' and 'I don't understand how something works, so it is worthless'. The former is healthy and the first step to understanding, the latter is ignorant, and the first step to getting more ignorant.


Relax, Ayn Rand cum True Believer complex.

The current state of business within the current iteration of how people interact with one another isn’t some necessity.

Yes, the world may fall apart for a relatively brief moment in the grand scheme of things — but then life will go on.

The first step to understanding this is to drop the superiority complex.

Very little is actually needed to keep the world turnin.


I actually think both you and the guy you are arguing with are half correct.

The real answer here, in my opinion, is that yes there would be pandemonium, and then yes, the world would go on without it, but then something else just like it will pop up. And that is because a liquid market for financial assets (whether that is securities, options on securities, futures, etc) will always be a massive benefit to the ability of businesses to conduct business, and the ability of individuals to preserve and increase wealth.


Exactly. Our current implementations of resource rationing isn't some fundamental of reality, or even needed by human society just 1-2 centuries ago.


All of my “culture” experience working in finance were uniformly better than pure tech.

The movie portrayals don’t match my experiences at all and I saw a lot more bad behavior in the tech companies I worked for.

Heck I saw more people working for the intellectual challenge of it in trading than I did in SV style tech firms where money drove nearly every decision.

It’s really hard for me to buy that SV style tech companies are a better place to work when for the last 2 decades the business models that have been front and center are panopticon style tracking to sell ads and legal arbitrage.


Oh, don't get me wrong. I pretty much abhor SV/VC culture too. It's why I don't have one inkling of a notion to work on either coast for the "big" corps.

It's not an either or, I can hate both ;-) I'm a big boy and get to make up my own mind on the matter.


I was remarking on the key word "tangible", not trying to express an opinion one way or the other on financial institutions. Accidentally forgetting to do something and ending up in the news because you caused havoc when the markets opened the next morning is more "tangible" (able to touch things directly) than working on AI or self-driving cars, at least currently. Certainly working in either of those fields might provide more benefits down the line.


Will the world stop turning if people stopped working on self driving cars or AI?


i'm guessing you're trying to make a point here, but care to elaborate on what it is? i think you well know the answer to the question


Serious Answer :

There are problems in Fintech that are absolutely worth solving for altruistic reasons. One that I think is very important and might even need to incorporate AI is this :

Larger financial institutions have access WAY more and WAY higher quality data surrounding stocks and options. For example, publicly available SEC filings contain extremely useful information about companies. Professional traders have access to services which provide this data accurately in programatic form (like an API). Us normal people have only the SEC filings themselves, which are enormous documents. It would be impossible to read them fast enough to ever catch up on all of them in the last say year. There are free APIs, but they are absolute dogshit and provide incomplete and inaccurate information.

If someone could democratize this and provide this info for free or cheap to the public, it would be an enormous benefit to the general public.


You're not making it easy to get any answers, if you cut out 2 of the main reasons for people in general to change jobs.


I consider these things table stakes when choosing a job.


There’s also geography. You go far enough East and you have mostly public sector or defense jobs. A little smattering of insurance data processing. And then fintech.

illinois.edu has one of the top rated CS programs, but once you graduate there are not a lot of options but to move to one of the coasts, or move back/to Chicago and try your luck there. Second City has a good deal of fintech.


>If someone is smart and capable enough to work on tangible things like AI systems or self-driving cars

Maybe they aren't as smart as they think they are? Or they find that there are interesting problems to solve in fintech? Problems they can tackle and see resolved in a realistic time frame vs 'tangible' (?) self-driving cars or chat bots.

I know AI encompasses a far larger range of things but right now, what problems is it solving? Artists, writers, and others can do that work. What do self-driving cars resolve beyond continuing the dominance of car culture in a world that could have better public transit and safer infrastructure?


I'm not recruiting for them, just sharing my experience. I included their careers link for people who might be interested because I know they're always looking for good engineers.


You answered your own question. The only motivation is greed.


Taking a well paying job is greedy? What planet are you living on?


These "issue traced to staffer" stories sound like management cover up for management/system shortcomings to me.

Systems with such significant potential impact, and in industries where lack of financial investment in their continuity is a deliberate choice have very little excuse to be passing the buck to grunts for basic process flaws that can be triggered by individual error.


This. If an individual's mistake can take out your business you have a process control problem and that is owned by management.


The days of management taking responsibility for anything are over. See: not a single CEO stepping down for over hiring.


Oh, they take full responsibility, it always says so in the mails they send out. It's just that taking responsibility doesn't appear to actually result in anything happening.


Macroeconomic changes have made it impossible for me to want to pay you

https://news.ycombinator.com/item?id=34515267


A cyncical take might be that they are saying that they take responsibility (credit) for reducing the monthly payroll expenses. They may also have overhired in the past, but what's in the past was already paid for. The savings next month is how they justify a large paycheck.


Their punishment is in bearing the shame of having been wrong. That’s the price of leadership.


What shame is there in being wrong? Being wrong is the ideal state, paving a path to gaining an education, which is a source of pride and a benefit.


>See: not a single CEO stepping down for over hiring.

Wait, what? You think a CEO should step down because their management over-hired a relatively small proportion of employees and had to do some layoffs?


Not really sure why you're getting downvoted, other than to assume an emotional reaction from the community to layoffs impacting tech.

Frankly - People seem to be forgetting that until 2013, MS was still doing stack ranking and routinely letting go of the bottom 10% of their workforce (and they were hardly the only ones doing it...)

I don't see it as unusual AT ALL that these companies are doing a wave of cuts to headcounts after the large hiring sprees during covid. Especially as interest rates rise, so they're looking to lower debt burdens in the short term and pay off loans made at low interest rates instead of rolling into a higher interest loan in the new environment.

If anything... I'd expect the exact opposite - a CEO that fails to address cost centers as debt becomes more expensive is a liability, and someone the board might be looking to replace (ask to step down).

---

Does that mean I'm not sympathetic to those who've lost jobs? Of course not.

But tech had to rev the engine pretty hard to handle the extra load during covid when everyone was indoors and doing things online, and now that demand has dropped. So they're letting off the gas pedal.

If folks don't like it - blame the game. Work to unionize. Work to incentivize co-ops and shared ownership. Work to increase taxation on these companies and their highest earners (which... if you're in the tech industry almost certainly includes YOU). Don't go work for giant tech conglomerates and then act surprised when they act like giant tech conglomerates...


I'll make an extreme comparison:

"Kill one man, and you are a murderer. Kill millions of men, and you are a conqueror"

If you make some idiotic financial decision near the bottom of the management tree, such as... over hiring, you'll likely lose your job or get demoted.

Do it as a CEO, and get a huge bonus.

[1] https://en.wikipedia.org/wiki/Jean_Rostand


But it’s absurd. Companies are not supposed to only ever hire.

Some things are cyclical and you need more people for some amount of time, and then you find you need less. It’s not always predictable/seasonal like farming or holiday rush.

Is it wrong for a company to respond to market effects? That there was a layoff isn’t necessarily a sign a company did anything wrong… I think how they actually do the layoff certainly can be done well or poorly.


It’s not hiring though. It’s overhiring.

I’ve forgotten which FAANG it is. But one of them still has more employees than last year even after layoffs. It’s offensive.


Why is it offensive? Over-hiring has been a thing since at least the first dot-com boom. One's managerial power is directly proportional to how many "reports" they have under them. I worked at one company that raised a decent A round. We immediately rented another office down the street, spent close to 2 million on renovations, then filled it with anyone who could spell HTML. The B round was even larger, so the cycle continued (until late 2001 or so.)


It's a response to extreme demand during covid. When - you know - online service usage was at all time highs because everyone was stuck inside and doing things online.

It was likely the right call to hire then, just like it might be the right call to reduce headcount now.


So if they "under hire", should they step down for that too?

Maybe they should step down any time they fail to accurately predict the future?


Offensive? I'm... honestly, baffled. How could one tech company's ability to hire many more people actually offend you?


It's not relatively small. All the companies are experiencing similar chaos to the NYSE because people in the middle of important operational work suddenly vanished. The people laid off weren't idle like H&R Block tax preparers in May or Target clerks in January.

The people laid off and the people not needed were a different set of people, at the time of the layoff.


This is because the CEO's core job is to raise stock price. Nothing else. They hired in covid and profits & share price spiked due to the economic state at the time. Now the economic state has changed so they fire employees and the stock goes up. By that metric, the CEO will get a bonus at the end of the year. CEO does not get a bonus for not laying people off. Employees are not humans once you get to the csuite. An employee can be a person but multiple employees are just numbers on a ledger. They just send out "I'm sorry" emails to placate the masses and to get good media, no one really cares if the lower level people are upset. You only count once you get to a certain level.


> The days of management taking responsibility for anything are over. See: not a single CEO stepping down for over hiring

The list of managers stating that "they were taking responsibility" and then immediately stepping down was always fairly short.


They only take responsibility for the profit margins. Over hiring affects those but often not significant enough and can be corrected with layoffs.


No, they only take responsibility for short-term market cap. Margins and profit don't matter. That's why they chase whatever fad hits the investor class.


They are taking responsibility. They are just delegating the consequences to their staff. I suspect this will change soon. Activist investors are already surrounding companies like Salesforce and I can see CEOs being promoted sideways (board member only).


I don't see why we reward scale-out/scale-in in the cloud but punish CEOs when they do the same with real people /s


How will those poor decommissioned computers get enough bytes to feed themselves?


There's plenty of companies replacing their CEOs. Just today Toyota announced theirs.


The CEO of Toyoda is becoming the chairman of their board, that doesn't feel like a CEO being replaced as punishment for poor performance in the way that people are talking about in this thread. But even when CEOs are fully ousted over issues, the golden parachute makes it barely feel like a punishment anyway. I'm having trouble thinking of a case where a CEO actually seemed to be significantly financially impacted by such an event, though maybe FTX will provide an example shortly.


Are the golden parachutes bigger (as % of annual comp) than employee severance packages?


Do you think that honestly matters in the 10s of millions of dollars range? I certainly don't. The problematic parachutes in question are beyond enough for an excessively wealthy standard of living for the rest of their natural lives, even if it's proportionally smaller. Whether or not CEO comp should be as high as it currently is is another question entirely.


If you have processes where there is nothing an employee can do to affect the outcome of the company you successfully built a legacy bureaucracy that is waiting to be disrupted.


In this specific case, I don't think that's necessarily the outcome. Our industry has yet to accept a universally-acknowledged equivalent of a lockout/tagout (LOTO) interlock. There is no need for a bureaucracy if we have cryptographically-enforced multisig Shamir secret sharing keys where a LOTO prevents (in this case) a system from spinning up while another system (the backup system apparently in this case) is running. Allow it to be overridden by a sufficiently senior manager or say a sufficient number of lower-seniority managers, which leaves an audit trail. Integrate with a change management, notification, secrets storage infrastructures, and infrastructure as code, and it encodes these infrastructure dependencies into code, and can be queried to auto-construct change interlock sequences for a particular desired state.

Of course, once you take advantage of such a representation at scale by deploying tremendously more complex infrastructures, you then have to deal with the dependency network meta challenge lest you inadvertently fall into dependency hell. While towards there lies NP-hard problems, they're still computable to a reasonable degree and I dare say a more robust situation than doing it all by hand like we do today.

The real challenge is the vast majority of devops staff today would really dislike reasoning about such a representation when it blows up in their faces, and I can't blame them for that kind of reaction.


It's very easy to talk about completely automated systems and LOTO and you need these when you have under-skilled staff. The NYSE likely does NOT have under trained staff. If you have LOTO systems etc, what do you do when a sensor fails and you can't figure out why your method for checking whether the other system is running incorrectly thinks it is. Do you allow the stock market to simply not open?

What if multiple sensors fail or it's an ambiguous situation like say you are deciding whether or not to fail over a power circuit and it's a brownout but not a complete power failure? What if there is a systemic problem and it's likely the backup power source is going to brown out too? At some point you need highly skilled individuals, like say trained airline pilots flying a plane who have the authority to override systems immediately without having to jump through hoops.

This is especially true for mission critical systems. Many of the mission critical systems we rely on are NOT built on the cloud, i.e. other people's computers because you want to be really careful about what hardware you are using, precisely how your data center is setup and want to make sure things like a noisy neighbor do not impact you.

Like it or not, these highly trained individuals are going to make mistakes every now and then. A failure like this once every decade or so really isn't so bad. The individual who made this error is likely not a "grunt". I suspect the individual in question will not necessarily suffer any major consequences as a result of this unless it wasn't a mistake but a flagrant disregard for the rules like say bringing a bottle of water into a data center that then spilled or something.

Have you built a mission critical, distributed system that hasn't failed for 10 years? It's a lot harder than it looks. That's how often the NYSE has a problem like this, about once a decade. A lot of things that work in theory, don't work for the edge cases and things that lead to problems once a decade or so are extreme edge cases.

In the grand scheme of things a mucked up opening auction is a minor problem and anyone who did not take the precaution of sending a limit order and sent a market on open order despite it being standard practice to essentially always use limits and go hurt badly will be made whole.


> If you have LOTO systems etc, what do you do when a sensor fails...

It pretty much boils down to: it depends upon what the business wants to prioritize; operating margin or resiliency. There is an entire subfield investigating the statistical foundations of resiliency, and the general case of N-modular redundancy is in practice implemented as triple modular redundancy in most commercial systems that want to spend in this vector.

> Like it or not, these highly trained individuals are going to make mistakes every now and then.

Absolutely, and here is where the organization's no-blame learning culture swings into action for the well-led teams.

> It's a lot harder than it looks.

We all know this, and we can all help each other get better to deliver ever increasing value to our customers by sharing what works for the context we deployed within!


You don't get to major failures once a decade (or less) on systems this complex without understanding and in fact being on the cutting edge (likely ahead of what you read in journal articles written by academics) of the statistical foundations of resiliency, n-modular redundancy etc.

In real-life outside of a journal article, it's a lot harder than just deciding whether you want to prioritize operating margin or resiliency at 5000 feet.

In real life when these sorts of edge cases happen, you have to understand in minutes or sometimes seconds the tradeoffs in terms of costs to your own company and your customers of one of n specific possible failure modes and risk-manage so you minimize the probability of the catastrophic outcomes. This sometimes may involve increasing the probability of low cost bad outcomes. You can't reason about this stuff before hand. If you could, you would have designed your system to not fail in that manner.


>If you have processes where there is nothing an employee can do to affect the outcome of the company you successfully built a legacy bureaucracy that is waiting to be disrupted.

Exactly.

I wonder if any of the people claiming "it's management's process fault!" would be the first to complain about their workplace where they have no autonomy.


On the contrary, if a single employee can take out the whole business, you are guaranteeing disruption.

There are many kinds of "outcomes". A simple backup would make outages far more rare.


Everyone makes mistakes. HOWEVER, Mordern leaders get to the heads of large organizations by never making mistakes, by blaming the little guy, or coworker and hustling up. They'll only fix this because they have too. There isn't a problem until it happens.


Exactly, management shouldn't allow this kind of situation to occur by designing the xomoanies processes such that there are checks and balances


> individual's mistake can take out your business

It didn’t.


They're not mutually exclusive. A staffer leaving a backup system running may well have been the proximate cause of the issue but, if true, it was also likely a management/system issue as you say. The article is a bit strange in that it doesn't attribute the fact in the headline to any source. I don't see anything from the NYSE saying "it's all that guy's fault". On the contrary it says:

> [NYSE execs] plan to examine the platform’s procedures and management, potentially reworking rules to be more flexible and provide further protections.

Sounds like they know it's a management issue. The headline probably focuses on the staffer leaving the backup system running simply because it's a better headline.


yes ... if an organisation has a critical process where a single human making a mistake can cost it $millions then it has a management / process level issue not an issue with the human. Humans make mistakes. Apart from other obvious issues with it, creating a context where individual mistakes lead to horrific outcomes will create a toxic and horrifically stressful workplace - I would actively avoid working in a situation like that myself.


> These "issue traced to staffer" stories sound like management cover up for management/system shortcomings to me.

At some point you need to strike a balance between freedom/flexibility and stupid proofing.

HN goes real hard on the "people are idiots and we should design things that no matter what buttons get mashed it all works out fine" side of things but in the financial world the balance is struck a little further on the "train our employees to not be idiots" side of things.

Furthermore, it's usually better optics to blame things on people because people can easily and cheaply alter their behavior cheaply (per incremental change). If you blame the outage on systems it raises questions of when it will be fixed and how much $$.

As an aside, it was almost certainly not individual error. At places like NYSE you pretty much always have 2-3 people who should be in a position to catch a mistake like this.


> it was almost certainly not individual error. At places like NYSE you pretty much always have 2-3 people who should be in a position to catch a mistake like this.

That's exactly the point that is being made here. Either the message being put out by the NYSE claiming this was an error by one individual is true -- in which case, NYSE leadership is to blame for setting up a process that allows catastrophic consequences for a single individual's error, OR the message being put out by the NYSE is a fabrication designed to redirect blame at some scapegoat, in which case NYSE leadership is to blame for putting out a false or misleading statement.

[Edit: It seems I misunderstood -- attributing this to an individual was done by reporters and rumors, not by a formal statement from NYSE.]


It does not sound like cover up to me.

It was simply the explanation of what happened. I didn't get any hint that the said "staffer" will be fired or otherwise punished.

Is there a problem with the system that did not have enough safeguards to let this happen. For sure, but then no system is perfect. This glitch does not happen every day. From memory, I remember a NASDAQ glitch at Facebook's IPO. Let's say there are 2 or 3 glitches like that for major exchanges in one decade. How can you design a system that prevents bugs that show up once a decade?


We have a Slack channel where we are expected to announce all of our production changes.

Some updates are highly regimented, but a couple of the more operational teams have discretion to deploy things outside of that process, and most teams can flip feature toggles whenever they want.

Point is that sometimes people will comment, or even veto changes. We have a major customer visiting today, or the sales team is at a conference. Don’t touch anything or you might break something.


I had a project that was really important once. I made a tiny mistake that had a big consequence - a couple of hours of potential lost revenues from our customers. I fixed my mistake with both my boss and CEO nearby. I said after I pushed the fix that I really need more resources around it. That little light of "yeah, this is important" that should have flickered didn't. :)

I will not be surprised if nothing gets fixed with the issue at NYSE.


If you were to ask what is the probability that this specific error happens again, I would think it would be pretty low. Probably lower than a week ago. If you were to ask what is the probability that some significant, costly error happens again, I don't think the probability is that much lower than a week ago.


But how much actual lost customer revenue? Also, did the customer even notice or not?

You're reminding me of the difference between engineers and non-technical managers; to many of the latter something's only a problem if/when the customer or senior mgmt are on the phone complaining about it. Until then it's all naysaying engineers being too pessimistic about process and risk.


No one gave me a figure, but the product going down does have a direct impact on customer revenues. So, while no one actually gave me direct numbers, it was felt quite a bit. The most notable thing from the incident was where certain account managers in Japan had to do formal mea culpas because of this mistake. So, in other words, they were on calls, but I got back just "you can't have this happen again - do something." Could someone else at least do code review? I was depressed that I got a lot of "well, we don't know the code base" - the issue was lost in the shuffle also with management. So, I just grew more pessimistic about process and risk. It's not healthy and I would not bottle it up now, but I did, and that was my mistake. I ended up hating that project, but at least I got to train someone else to work on it (who has a team around him.)


We can always blame management, since they make decisions, including hiring, we can always trace back problems to management. But it is as unhelpful as blaming the grunts. Management has the job of making the company profitable, if they don't employees won't get paid, investors will lose money, and ultimately the company will fail and customers won't get service. And just like the "grunts", they are not perfect, sometimes, they make mistakes, sometimes, they have to take chances.

In fact, blaming anyone is unhelpful unless baltent misconduct is the problem, and I don't think it is the case here. As always, shared responsibilities. I just wished a different wording, something like "NYSE Tuesday opening mayhem traced to a backup system not properly shut down". Leave the "staffer" part to the technical report. It is useful information for investing the problem and fixing what needs to be fixed, but it is inconsiderate for a press release.


The difference is that one asserts authority over the other.


> These "issue traced to staffer" stories sound like management cover up for management/system shortcomings to me.

At the end of the day, the engineers are responsible for the engineering. Managers are responsible for managing. Shifting all responsibility for execution issues on to management can give warm fuzzies, but in reality managers aren’t all powerful in shaping execution by engineers.

Companies that put all blame on managers when things fail are inevitably encumbered with excessive micromanagement, as the managers are effectively saddled with responsibility for execution as well.

The article was purely anonymous. I don’t think it’s fair to assume they’re jumping to blame or fire individual engineers.


Do engineers have authority over engineering? Can they overrule management on engineering issues? Whoever takes the authority gets the blame.


Look it's totally OK to recognize that a human action was the trigger for an incident - i.e. the causal chain for this specific incident started there. That's not the same thing as saying the human action was the root cause, and I hope by-and-large any kind of baseline competent engineering organization has gotten to that level of thinking by now.


These "issue traced to staffer" stories sound like management cover up for management/system shortcomings to me.

If you're going to move the blame up the food chain, might as well blame the shareholders for giving the company money and choosing to keep the upper management in place.


It's also true that in systems like this there exist many single points of failure. There's a reason decentralized systems are seeing a rebirth.


Sounds like many places I’ve worked. I think most devs have had a job like that.


Seems so weird to not have automated checks for something that seems to be described as "someone left the light on" and also not have the exchange automatically initiate itself. However, it still isn't that clear what the problem was. Were prices not "real" or correct?

Stuff like this will happen more and more. We treat software driven systems rather recklessly.


Matt Levine at Bloomberg gives a good explanation: https://www.bloomberg.com/opinion/articles/2023-01-25/nyse-f...

Basically at the market open all the requests to buy and sell get matched at the same "open" auction price, then (a second later) the orders get sent to the order book, where the price can go up and down based on size. Because the system didn't think there was an opening, there wasn't the opening auction, the prices went straight to the book and there were large swings in price.



More automation -> more code -> more things to go wrong


The great thing about automation is the breadth, depth and speed at which I can propagate mistakes.


Some submarine operations are deliberately not automated, because if eg a sensor is broken or miscalibrated it could sink the entire boat if a computer very rapidly acts on the wrong information. Rather they do those operations with one person operating the valve/machine/device/etc, another watching and confirming readings, and the whole thing is on a constant audio link with a third person in an engineering room who watches the readings through a centralized system.

It's clearly not optimized for efficient use of personnel, but the personnel complement will have been designed to provide sufficient people at all times and the cost of getting it wrong can be very large indeed.


Working on the repo desk of a large Japanese bank in New York in the 90’s. There was a big (both in font size and magnitude) number that was on the upper left of the blotter system that ran on every traders desk, which represented the total we had to borrow that day to fund the banks trading book. There would be a number below it which would represent how much we had borrowed so far.

It was “too important to automate” so a trading assistant keyed it in every morning. One morning he typed the wrong number and the mistake was in the billions digit.

At 2:45 “the cage” called the repo desk and said “You know you guys are still short a billion, right?”

There was then a flurry of activity as traders got on the phone to try to borrow a billion dollars in in fifteen minutes, while also trying to not let on we were kind of over a barrel. The head of fixed income prepared his explanation to the Fed about why we needed to borrow a few hundred million overnight.

The number got automated in our next release, and the open procedure was changed to the trading assistant verifying the number against the “cage” report.


> It's clearly not optimized for efficient use of personnel

I disagree. When the sub sinks, all these people die. That's far more inefficient use of personnel.

It feelf like you're putting more emphasis on the material cost ("the cost ... can be very large indeed") than on what actually matters.


It feels like a wartime-vs-peacetime priority problem.

In wartime you would care about efficiency, build the largest number of subs staffed with the minimum crew. And since training crew quickly becomes the bottleneck you would probably go for the highest degree of automation that doesn't impact production times too much. In peacetime, efficiency isn't as important. What is important is the bad PR of losing one of your submarines in a training excercise or on patrol, so crew safety becomes a much bigger concern.

Losing those sailors in war would have been a noble sacrifice for the cause, losing them to the exact same accident in peacetime is a national tragedy.


Losing a ship (boat?) during peacetime is a PR disaster.

Losing one during wartime can cost you the war.

An inefficient one that probably won’t sink due to combat damage can stay in the fight long enough to matter.


Submarines are referred to as boats rather than ships due to naval tradition.

As a former naval officer I can only say that technically losing any vessel could be the one that loses a war, just as any soldier lost could be the straw that breaks the camel's back. But if your navy is so rickety that the loss of a single vessel is enough to lose then the main deficiency was in planning rather than any specific warship loss. Losing one in peacetime should never happen but is not unheard of even in modern times. See eg the Kursk or the Fitzgerald.


It’s also not unheard of for a single vessel to have outsized effects on an entire war https://amp.theguardian.com/world/2017/oct/20/enigma-code-u-...

It could also be argued that the Romans capturing a single Carthaginian warship turned the tide for their entire empire.


This is now my headcanon explanation for why starships in Star Trek still require large crews (or crews in general).


On the one hand, given how often sensors fail and AIs flip the evil bit in Star Trek, limiting automation is probably a good idea.

On the other hand, the Enterprise won't even warn anyone when command staff are injured, cloned, mind-controlled or vanish from the ship altogether unless a human asks the computer where a specific person is first.

Of course there are Doylist reasons for all of this but I do like the premise of a general fear of AI and possible weird space BS being a factor.


> This is now my headcanon explanation for why starships in Star Trek still require large crews (or crews in general

The canon explanation is that automation-in-charge was experimented with and went really badly, though periodically they try something approaching it again.

https://memory-alpha.fandom.com/wiki/The_Ultimate_Computer_(...

(AI, human genetic engineering, and a number of other areas of technology are affected by variants of this issue in the Trek canon.)


At least in ST:VOY, it has been shown that a single hologram is capable of running the ship - although one might argue "exceptional circumstances" ;)


In Star Trek III they jury-rig the Enterprise to fly with a crew of 5, instead of the regular crew of 400.

Though in that state it can't do much more than fly: combat capabilities are strongly diminished, maintenance doesn't happen, post-combat repairs are out of the question, science missions would be much harder. On occasion the Enterprise has transported 150 passengers, so I imagine there's a lot of kitchen staff, security, etc. You only need 5 people to fly the ship, maybe 40 to fly sustainably with maintenance, but to actually accomplish their reglar mission you need the other 300 people.


I will argue that the corporate world is the exact opposite of the training and discipline of a submarine crew. Most of the time I wonder how the businesses even survive the chaos and mismanagement, let alone make money.


Knight Capital will forever haunt fintech engineers... https://www.henricodolfing.com/2019/06/project-failure-case-...


I've not seen it put so eloquently.


To err is human, to really foul things up requires a computer

- William E. Vaughan https://quoteinvestigator.com/2010/12/07/foul-computer/


“A computer lets you make more mistakes faster than any other invention, with the possible exceptions of handguns and tequila.”—Mitch Ratcliffe


This employee left the backup system running. There's obviously some automation but what is the solution?

Process changes that people have to remember or more systems to prevent the issue. So I don't get your statement related to this article


I don't imply that there is a solution.

We will simply create a second system (B) to monitor the first system (A). Now we have two systems to maintain. System B will not be capable of steering A by itself. So we still need to know how to diagnose and repair A, and we also need to know about B too. Maybe system B can talk to a Prometheus/Grafana stack (if it's up). And that can put alerts into Slack (which we ignore because there's always alerts in Slack). And after standup we can take turns looking at graphs with consternation.

> Stuff like this will happen more and more. We treat software driven systems rather recklessly.

That sentence is where I go when I hear the word 'automation'.


>There's obviously some automation but what is the solution?

Shutdown and wake-up time in bios of server and switch ;)


>More automation -> more code -> more things to go wrong

More people -> more entropy -> much more things to go wrong


One person might be inattentive or drunk, but it's less likely that two people are. So you institue a two-person rule. And if that's not good enough, add a third person to double-check. Maybe a supervisor to observe the people doing all of the above, to catch any mistakes or negligent behavior. Also have them write down the steps they have taken, and have somebody else read through that to verify. Just keep adding people until you are satisfied with your odds (and hope you are not making it worse through second-order effects)


So two people are more reliable as a automated system you wanna say? That's totally wrong....


Sounds like you've never experienced a catastrophic failure due to an automation that didn't work right.

I did last year, and my company is in the process of de-automating certain processes that can endanger that company if they go wrong.

There are many things in tech that are too important to automate.

I'd even posit that the more experience you have in tech, the more you've seen how things go wrong, and the more you realize that automation is a tool for humans to use, not a replacement for humans doing a task.


>Sounds like you've never experienced a catastrophic failure due to an automation that didn't work right.

Much much much more due to human error......but hey maybe you are the worst programmer ever..but even then i would say your programs are more reliable then a human.


I don't think it's that they "left it running" like you would leave a backup app running... they literally left the entire disaster-recovery site up and running and live. Cermak (referred to in the article as the "backup") is an entire datacenter, hosting a running copy of the exchange to be used in a failover scenario.

You'd have to have more than 1 person involved to forget that DR is still active when completing these failover exercises and tests off-hours.


> Cermak (referred to in the article as the "backup") is an entire datacenter

Well, they are a tenent at the Cermak data center. It’s a truly massive building with huge amounts of connectivity and colo opportunities. Probably also the only 100+ year old data center building on the US register of historic places, lol (it’s a former catalog printing facility, built to hold insanely heavy printing presses on 8 or 9 really tall floors, so it has no problems with densely-packed server racks)


Yeah my point is only that it's a full DR site, not a "backup" that was left running as the article pointed out (and as a lot of commenters are insinuating).


Ah, I see. Some quick internet searching shows that from time to time (rarely), NYSE operates the Cermak site as the primary site for at least part of their operations - for up to a week at a time.

It seems to have multiple purposes - DR, customer software testing, etc.


Well, you do that to make sure it works. At one of my previous jobs, we'd switch the primary and secondary datacenters every six months just to make sure we could.


The auction is meant to find a stable price and can have some wild prices coming in, because no matching happens at the point of entry, the market will naturally find a level before trading commences. In this case, those wild prices were matching, resulting in crazy trades no-one would have expected.

You'd be surprised how many manual processes there are in places like this. It's a combination of legacy systems / processes, and a general paranoia around automation going wrong. I wouldn't be surprised if they always have someone there to shepherd the system along.


When I worked on a trading platform, I spent many a happy Sunday night waiting for the Australian market to open and watch the first orders go through successfully.

We had hundreds of jobs and upgrades happening over each weekend. It definetly needed an eye casting over it regardless of the automation.


I am working at one now and nothing has changed. Automations are so many we recruit an army of people just checking if they ran, while knowing how to replace them if they didnt (or, more accurately, who to call at night to fix it asap).

And the AU orders going through is a good sign, but it's far from guaranteeing a free monday, as Japan, Korea or Shanghai can fuck it up, each in their own little ways. Hong Kong is the best, low regulatory crap, invested regulator, high volume low latency traffic everyday (relative to the region), I cant recall a time it broke.

Once, someone fat fingered an excel import at close, and we lost our trading license for that entire country for 18 months. And we're not small. But the amount mismatched at settlement was super tiny. High attack surface, low holistic understanding (it works despite us, we honestly have no clue sometimes), heavy consequences on screwup.


Is it normal to simply accept the word of an anonymous source for something so important? I genuinely don't know, anymore, but it doesn't seem like a good idea. I'd rather wait for a more thorough investigation. Especially when the story from these sources boils down to "Kevin was in charge of booting the NYSE App that morning, but he was late for work. He had a good excuse, though, he flaked! We'll have the chap straight up for lunch, no question".

Edit: I also note that this piece is lacking the traditional "The NYSE did not respond to a request for comment".


>Is it normal to simply accept the word of an anonymous source for something so important?

Anonymous means they aren't revealing the source, not that Bloomberg doesn't know who the sources is, or what they do.


I know that. I am referring to you and I, the reader accepting the word of the anonymous source. Combined with the fact that they apparently did not ask for a comment from the NYSE before publishing this. Or if they did, they neglected to mention it.


We aren’t accepting the word of the anonymous source. We’re accepting Bloomberg’s word that the source is reliable.


Some of us aren't "accepting" anything. We're just reading about a potential cause of an incident and speculating about how it could happen to us or could have been prevented. Just because we're reading this article and commenting here doesn't mean we just believe everything that we read. The post-mortem will come out soon enough and we'll read that and comment again.



I don't know how Bloomberg works, but the New York Times has a very clear and public policy about using anonymous sources.

There's usually a link to it in the middle or end of any story it publishes using an anonymous source.

The Times isn't Bloomberg, but it might give you some insight into how these things work.


A lot of stocks have minute bar of wide range prices in the first few minutes of continuous trading. This seems like the incident in what caused Knight Capital fiasco, in which the system repeatedly buy on the ask and sell on the bid very fast, thus pushing the high price high and low price low. In the opening usually the market maker will be more weary of the risk they cannot hedge directly and thus will be less willing to take on positions, leading to wild swings.

Still this report (and the previous statement) does not give enough detail on why a backup system misoperation resulted this. Also, critical large systems like exchange rarely have a single point failure. Usually there will be a sequence of issues along the event chains leading to this. Thus one "failed to properly shutdown" caused all this is a bit incredible. We will need more explanation.


Knight capital issue was a test flag in the code that caused orders to multiply.

From what I'm reading, this NYSE error seems a bit more complex where the presence of a backup system confused the current market state to skip the open auction.


Doesn't each trade require both a buyer and a seller, who buy and sell at an agreed on price? Presumably both parties would be satisfied with the trade so it isn't clear to me what all the fuss is about.


If I had put in a market buy/sell on open order, I'm accepting the market price, but expecting the market price to be set by an opening auction. I don't know if a market on open order would have been cancelled or just executed shortly after the bell; you could argue for both treatments and it usually doesn't come up, so it might not be mentioned in retail brokerage documentation.

Personally, I always do limit orders, but I would consider market on open/close as reasonable options. But I don't think this is typical, a lot of orders are market orders against whatever limit order is at the top of the book. Normally, that's ok, but it gets weird when things get weird, as seen here.


I don't know what happened in current case. In the Knight Capital case, KC clearly didn't intend to send those erroneous orders. And if those trades were not annuled, KC would not be able to settle those, since the trade loss were larger than the collateral KC put up.


The trades were not annulled, because NYSE ruled them not "clearly erroneous". Which is why it is was an existential mistake, not just an embarrassing one.


I haven't really researched anything, but both parties thought they were bid/offering into an auction with time to cancel or amend their orders.


Market price trades are executed at whatever the current price is. Presumably it’s those trades that caused havoc.


Oh we're still doing scapegoats?

If your system can be hosed by a single person the system is at fault. Start with the scapegoat's manager.


If the trades are being cancelled, are they going to correct the chart data? Right now it looks very misleading on the daily[0], weekly, monthly, quarterly, yearly etc for large caps that trade quite steadily otherwise. I do understand that this would be a challenging effort as that data already flowed to and was stored by all the broker-dealers, but I think it should be done.

[0] https://www.dropbox.com/s/6jdmgkdyei9xqz0/mcd.png?dl=0


Technically yes, historical market data feed needs to be cleaned up. Which will be a nightmare for every single person who maintains one...

Which is also why exchanges are very reluctant to mass cancel trades. The knock on effect goes beyond just market data feeds


The same feed that publishes trades also publishes trade busts, so it's up to whoever's consuming it downstream to take care of.


Successes are management's, failures are individual's :)


True. But when that happens, that's a textbook sign of lack of leadership.


Tuesday news blackout; Thursday "it was all Jim's fault"... Right.

Smells like horse shit. Most of what comes out of the profession of "journalism" does too, lately; but this smells strongly.


Why couldn’t have been Jim’s fault? The world is run by Jims.


As others have said, if Jim's fuckup can have such large consequences, then there should have been backstops for Jim who would have shared the blame. Nothing against Jim, it's just he's being used as a distraction to avoid talking about the real issues.


Manager: "I hope nobody asks why Jim was allowed to have so much power with no oversight or validation"


and all the trades are being reversed basically. The whole thing stinks.


Not all of them... only the most egregious.


I am short a couple dozen naked Feb 3 calls on a lot of affected tickers and almost had a heart attack when looking at one of them on my phone. Thankfully I was not in front of my computer at the time because I have no idea how my broker was managing my margin at open.


The LULD breakers saved everyone from more chaos here because effectively as soon as the market opened, all these symbols were halted.


Backup system connected to prod, that somehow reminds of the Knight Trading debacle. Someone there apparently connected some test code to their prod and they blew up the company in under an hour.


I mean - isnt that how all DR is essentially configured? You need it "somehow" connected to Prod depending on the failover, config, system etc. And in many of these complex systems DR can be on a subsystem etc - not an "all or nothing" approach?


Galileo's Principle bites hard, great explanation here: https://99percentinvisible.org/episode/cautionary-tales/tran...


There is so much stupidity in the process they describe, I have no faith it will be fixed. a manual daily DR test that clearly wasn't followed by a test or checklist or double checked by another person, and leaving the DR up broke prod?? literally none of those things should have happened.

I know the world is held together with duct tape, but it's embarrassing when you see the tape fall off.


The way their DR is setup is that clients of NYSE (brokerages, OTC systems, firms, banks) all have IP (not dns) connections to the primary NYSE production datacenter and a full second set of IPs for the DR site. It's not a "dns and load balancers" setup where the service itself can just route the traffic somewhere else. The clients themselves determine where to connect to consume trade data and execute trades. There is likely some modus operandi given to clients on how to connect to primary and DR sites based on some specific logic.

The NYSE DR guide [1] says that if DR is active, production is not. It's not a distant reach to consider that some of these clients have a deadman switch doing a healthcheck poll on DR and switching to it when it see's that it is "up". If they've built their systems in such a way that when it detects the DR site active it uses that, then it makes sense that having both "online" would cause some havoc. I'm sure the complexity of the entire exchange is fairly significant, and having "two" copies of it running in parallel with both able to accept and execute trades would be a scenario that can cause some unintended consequences. Fundamentally, an exchange is "atomic" and transactional and cannot be meaningfully distributed to two sites that are that far away. The replication in place is likely master/slave with a switch to make the slave primary. Anyone who has toyed with master-master replication on less complicated databases knows the issues that can come up with split writes. Imagine that at the scale of a system as large as the NYSE.

[1] https://www.nyse.com/publicdocs/support/DisasterRecoveryFAQs...


Remember the flash crash of 2015? They let those trades actually STAND. Including options. This week’s open was nothing in comparison.


That wasn't an exchange "malfunction" in the sense that the exchange did not do what it was supposed to, was it?


Do you really think they will take responsibility for billions lost/made that day? No, that's a big liability. Anyone trading that day knew it was a big glitch at open. Some names, BLUE CHIPS, were down 40-50%! What shocked us all is that they actually allowed the trades to stand. What was different about this week was that the SEC actually tried to do their jobs for once and the exchange had to address it ie come up with bullshit excuse.


So... what was wrong? Why should those people not have been allowed to make and lose money?


While we're on the subject, if your company uses technology in an capacity, read the book The Phoenix Project.

https://www.amazon.com/Phoenix-Project-DevOps-Helping-Busine...


I wanted to like that book but it's a big Just-so story.

tl;dr: The brave knight implemented devops and everyone lived happily ever after!


Yeah. But it made clear many things that are otherwise too often cloudy. And despite its age it's still relevant.

Some people need a story.


Tech can be so fragile. You do everything right, trade millions of shares everyday and handle billions of dollars. But you forget to run one script to shutdown a backup system and everything comes crashing down: your reputation in tatters, millions in costs to settle bad trades, barbarians at the gates.


> That misled the exchange’s computers to treat the 9:30 a.m. opening bell as a continuation of trading, and so they skipped the day’s opening auctions that neatly set initial prices.

I didn't even know about this process. I don't know much about trading, but it surprises me that there is a separate process for setting prices at the start of trading, and that if it's missed, chaotic prices result.

Is this related to how stock markets aren't really ever open 24 hours? Do they need that reset to function in stable way?


That's right. In short it's something like this: stocks trade on their primary exchanges during specific hours. For example 9:30 to 4 in the US.

Part of it is legacy from when trading was done by actual humans being at the exchange physically to trade during those times and part of it (I would guess is still the case) is to allow plenty of non-trading hours for back-office jobs and settlement.

So yes there's a special start of day process that runs at 9:30 that runs through all the orders on the books at that time and determines a price at which some optimal set of those orders can trade, trades them at that price, and also posts that price as the Open price for the day.

The process is different during continuous trading since orders are one by one matched against the order book.

Source: ran one of the world's largest equity platforms for 5 years.


isn't there another component to the NYSE auction where the DMM has some input into what the closing/opening price actually is?


Yes. In my reply to the first comment I mentioned setting the opening price is "more complicated". Every exchange has their own system for the opening auction which you buy into when you list with a particular exchange. Most exchanges have an algorithmic way of calculating the price. For NYSE, it's again more historical. A Designated Market Maker (DMM) for a stock technically determines the opening price. There is a person physically on the NYSE trading floor who represents the DMM firm who technically opens the different stocks. They have a weird custom keyboard from NYSE for this purpose...

The price is usually calculated algorithmically by the DMM firm and sent to the person at NYSE to approve. Pretty arcane. Also somewhat shady, as the DMM firm can be and is part of the auction themselves. DMM firms can analyze the order book to see what the imbalance is in the overlapping region, and place an order of their own to correct the imbalance and then set the opening price. I can see how one can profit from this in certain situations


I didn't realize the floor broker was actually involved with setting the opening price. I always wondered what the incentive was to access floor feeds for opening auctions


FYI, there's a similar auction for closing, too. The closing price isn't just a race for the last trade under the buzzer; there's a process where at some number of minutes before close, you can put in orders for close or realtime, and then magic happens.


Worked in HFT for a few years. The reason why most markets are not open 24 hours is more human, and just historical - aligned with people's 9-5 workday. There are also pre open and post close sessions of trading but it's much less liquid. Futures markets are open almost 24 hours. Even there, it's down for some time daily. Personally I think it's actually inertia that keeps existing markets this way - the systems of the exchanges and participants were designed with the assumption that they will have daily downtime, so it's hard to change. It's also dependant on how banking and settlement works - a lot of stuff happens after the trading ends. Batch processes run as different institutions settle their trades between each other, etc etc.

Now, as a result, there needs to be a way to set the opening price and closing price, like a bootstrap process. A smaller version of this process actually happens every time a stock gets halted and resumed.

An exchange has an order book - orders of things people want to buy and sell at different prices. During normal operation the buy and sell orders don't overlap in the order book - if two people want to buy and sell at the same overlapping price, they just get matched by the exchange at that moment. Unmatched orders stay in the order book data structure until a matching order comes along. The "price" you see in charts is just the midpoint between the highest buy and lowest sell price in the order book.

Now, if the order book is empty, what the heck is the price? That's what the opening auction needs to solve. The way it works is that people can start placing orders ahead of the opening bell, but they won't get matched until the open. So before the open, the order book is getting filled with orders, but crucially the _orders will overlap_. This "crossed" order book is a no no during normal trading, but ok before the opening auction. When the auction comes, a price is picked which maximizes the amount of orders filled (it's more nuanced than that, but bear with me). Imagine you pick a price in the overlapping region of the order book - every buy order that has a higher price than that will match with every sell orders that has a price lower than that. They will get matched and executed at the opening price, and BAM, you have an uncrossed order book, full of orders.

If the auction doesn't happen, and you just open the stock, then all hell breaks loose. Many things can go wrong here. Firms connected to the exchange may have code that assumes a book is not crossed (or at least not as crossed as it would be during an auction) causing wild behavior. The exchange itself could start matching orders haphazardly in the overlapping region, causing those "price swings" that the article talked about.

Can't imagine the panic that day haha.


Very helpful and clear, thank you.

> Now, as a result, there needs to be a way to set the opening price and closing price, like a bootstrap process. A smaller version of this process actually happens every time a stock gets halted and resumed.

So this suggests that if you did have a hypothetical exchange that ran 24/7... and something unusual happened to make trading halt completely (which always is going to happen occasionally, whether 9/11 level or more frequently)... you would still need to have that "bootstrap" process in place to re-start trading.

But if you normally ran 24/7, you'd have a process that you maybe had never used, or hadn't used in years!

This maybe provides another justification that isn't just historical for having exchanges shut down every day. So you are at least testing the bootstrap process daily, you don't have a bootstrap process you're going to need in an emergency (the worst time to have further problems) that has actually just been sitting around unused for years!

(Reminding me of making sure you test your backup and continuity processes regularly, right? And the irony here is that it's the backup/continuity processes which are alleged to have caused the issue here! but still, you need the backup/continuity processes...)


Matt Levine suggested that the chaos after opening was mainly due to market orders executing at ridiculous prices. Like, a limit buy for half the "real price" is the first buy order to get in the door, and that gets matched with a market sell order.

Does that track with your understanding?


Yes! I almost forgot about market orders because trading firms never use market orders for this exact reason - you have no control over the price if things go bad. Most flash crashes are exacerbated by runaway market orders and stop orders for example.

A buy market order would try to match with the "best price" which in a deeply crossed book would mean matching with a really low priced sell order. Exchanges match orders in price-time priority. Similar is true for a market sell order - would match at an extreme high price.

Besides the midpoint of the order book, another metric for a "current price of the stock" people use, is the "last trade price". In the situation above you would get "swings" in the price because market orders would be trading very high and very low if they alternate between buying and selling. The data structure on the exchange itself isn't "swinging", it's just the overlapping region being slowly eroded by market orders. The "last trade price" metric looks really insane in this situation.


https://www.nyse.com/publicdocs/support/DisasterRecoveryFAQs...

> Question: Can I connect to both the production and the DR site at the same time?

Answer: No, only one site is available at a time. When the primary site is up, the DR site is down; and when the DR site is activated, the primary site is down.

I think they need to update these docs to say /should/ be down


I sweat over "idiot-proofing" the smallest systems, while multi-billion dollar operations don't seem to care enough.

Like the S3 being blown away with a simple change in the early days, or GitHub running a test suite with production settings. It's like the FIRST thing I think about when starting a project.

https://github.blog/2010-11-15-today-s-outage/


If a single staffer SNAFU can send your exchange into chaos then you dun goofed at risk management and probably a whole lot of other management discipline.


I spent an hour trying to figure out why my new stock purchase had disappeared from my account. I had an order placed for opening on Tuesday morning, and I guess I was affected by the trade cancellations. Which is totally weird, because they showed up in my account on Tuesday morning after opening.


Its nearly always the damn humans. I find it awful that we are often assigning blame to some "technology error" when its the damn humans pulling the strings all the time. all those times the market shut accidentally. or the market opens at the wrong time. or that time someone accidentally deletes all the GTC orders in order to "save some disk space". that time someone tests opening the market at the weekend and puts in the wrong date. Sometimes we are just trying to test that the things work and so we take awful risks like adding test orders, or failing over to test that backup versions of the trading infrastructure still work. All these things add human execution risk.

That said I find the US market structure is unfair Charles Schwab does protest too much. Retail orders never seem to get near the central order book. there is no direct market access. brokers just sell your order to whomever MM pays them for the spread in return for a kickback. this should be a fantastic fair multiplayer game, but instead its pay to win mobile crap with vested interests milking their customers.


From the article:

> Meanwhile, market professionals and day traders are rattled and waiting for the exchange to elaborate on what it publicly called a “manual error” involving its “disaster recovery configuration”.

Oh, I love it -- a disaster caused by "disaster recovery configuration" :-)


Oh, I love it -- a disaster caused by "disaster recovery configuration" :-)

People install failover configurations to minimise time-to-repair or time-to-resume service (and some customers' contracts will demand this). This is at the expense of another layer of stuff to go wrong, and raising the possibility that it fails over when it shouldn't, causing brief but embarrassing outages.

It's possible in some such situations that, on the balance of probabilities, introducing mechanisms like this cause more disruption over time than they were intended to protect against, and that this is more widespread than often considered. Still, their operational cost must be borne in order to satisfy the clause in the customers' contracts.



That means some process is broken. Imagine reading “flight crashed because baggage handler place some suitcases wrongly in the cargo space”


This and other exchanges need to be running 24/7, in large part to level the trading field for retail investors. The backend should be handled invisibly behind the scenes. There should further be an exchange version of a Netflix chaos monkey running constantly to ensure such a critical infrastructure is robust.

The fact that these systems do not exist is an exchange problem, not a "staffer".


why do retail traders "need" to exist?

Retail traders realistically have only luck to rely on to beat hedge funds and banks. What they do is akin to gambling, which is on net quite negative for those who participate in it and heavily regulated. Retail traders don't serve any purpose in our society. They don't help with efficient allocation of capital and anyone who might be an actual savant in trading can join or start a firm rather than staying independent and unlicensed.


It seems that this wasn't as routine as these things aught to be but rarely are.


Yes it could have also been a test of potential escalation from Solomon Islands.


Oh, yes. It is the staffer who made the mistake.

What about people who designed it this way?


Weak blame game.


yeah absolutely nothing to do the SPX hitting a target 4k level and messing up everything just as it hit that level


[flagged]


I am curious what your "independent research" has turned up on the subject.


You suspect a breach?


He is, undoubtedly, a meme-stock conspiracy theorist. Only those steeped in the cult of AMC, GME, or BBBY say things like that.


there is no way a person, a single person, an unauthorized person, can have access into such system/functionality like this. utter BS.


my 2c:

In reality what probably happened is previous market day and post-trading data encountered some kind of error, which triggered a cascade of problems overnight that they were unable to properly rectify. This caused delays up until market open. They were unable to fully resolve the issue, and forced with either delaying opening the market (which is a HUGE no-no) or opening with wrong data as is, they chose wrong data.

All in all a lot of people didn't get much sleep Monday. More than likely they implemented some changes or updates over the weekend that were not properly done, or they encountered some errors, and didn't have adequate controls/time to roll-back Monday night. They made the right calls too late and there was a controls process up the chain that seriously fucked up. These are the kinds of problems that get the CEO woken up in the middle of the night.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: