More

_wmd · on July 23, 2019

The joy with bpftrace (and dtrace before it) for me is this ease with which 'synthetic' profiling events can be constructed from multiple underlying events. This can be used to, for example, only record the latency of malloc() while at least one TCP connection has been accepted and some particular function in your binary has already run at least once with its third parameter having a particular value

The offwake.bt example from the article is the closest to that, but it doesn't hook any userspace functions (like malloc). That's totally possible and extremely easy -- events can be mixed from wherever in the same script and, barring some knowledge like the fact the script is running on every CPU simultaneously, things just magically work

The main problem with bpftrace is that it's a pain in the ass to compile from source just now. A tool as useful as this really wants to be available on every machine by default

cyphar · on July 23, 2019

There has been quite a bit of work in recent months to make it easier for distributions to compile (such as using the system's bcc headers and libraries), so you should start to see bpftrace in more distributions. I packaged it for openSUSE almost a year ago, but it only recently became easy enough to package that I could reasonably submit it to Tumbleweed.

brendangregg · on July 23, 2019

It depends on your distro; last I checked an "apt-get install bpftrace" worked fine on Ubuntu 18.04. Debian has a package as well. We're tracking them in the INSTALL.md.

Companies like Netflix and Facebook have internal bpftrace packages -- it's a default install on the Netflix BaseAMI, so it's always there.

helper · on July 23, 2019

It doesn't look like bpftrace is available in the default repos for Ubuntu 18.04.

_wmd · on July 21, 2019

Transparently gluing boxes together over a low bandwidth fabric died as an active research area right around the time Plan9 was seeing its first development. By the late 80s shared bus SMP had demonstrated its practicality and quickly became the predominant architecture. Today we don't spawn processes on remote CPUs because the whole act of scheduling on multiple CPUs is entirely transparent to us, that's a competing architecture to the predecessor found in Plan 9

MOSIX is the only system from that era that I know is still around. It had a fork by the name of OpenMosix for some time, but according to Wikipedia ( https://en.wikipedia.org/wiki/MOSIX#openMosix ): "On July 15, 2007, Bar decided to end the openMosix project effective March 1, 2008, claiming that "the increasing power and availability of low cost multi-core processors is rapidly making single-system image (SSI) clustering less of a factor in computing"

(I admire the downvote, but please realize this is not a question of one's opinion!)

ori_b · on July 21, 2019

Plan 9 is not a single system image OS. And we spawn processes on remote CPUs more than any point in history, often using tools like Kubernetes.

nickpsecurity · on July 21, 2019

Although no downvote, there continued to be plenty of research and dollars in grid computing that did stuff like that on top of "distributed, shared memory" that did stuff like that. Then, all the research in HPC clusters that tried to create a "Single, System Image" running stuff across machines like it was one machine. The MOSIX quote doesn't change the fact that various researchers kept attempting this and making prototypes.

_wmd · on July 21, 2019

I think the keyword here is transparency -- later designs (even stuff like MPI) explicitly expose the topology of the available hardware. SMP is the closest thing we've ever got to true transparency, and then only for 80% of cases, and for those only because the compute nodes have very similar locality and e.g. memory bandwidth

Even SMP requires careful control if you want to get anything close to the actual performance of the underlying hardware, and the topology is once again very explicit

gnufx · on July 21, 2019

You better specify what "SMP" means. By definition, Symmetric Multi-Processing doesn't have locality concerns, but that's basically dead, and Shared Memory Programming does, indeed, typically require attention to topology and thread binding -- frequently ignored, of course. I don't know the MPI standard well, but I didn't think there was anything requiring topology to be exposed, and it won't be in the absence of something like netloc, which is still experimental.

gnufx · on July 21, 2019

"Distributed, shared memory" isn't what I saw of "grid computing". That mostly seemed to be driven by people who didn't understand distributed computing -- with some exceptions like Inferno -- particularly with Globus. People slapped "grid" on anything they could get away with, though. However, distributed shared memory (or the illusion of it) for compute systems does date from the 90s, at least in Global Arrays, and is going somewhat strong in various PGAS systems, including GA.

Kerrighed was an SSI system that actually was actually commercialized, apparently unsuccessfully. Current (I think) proprietary software systems in that sort of space are ScaleMP and Bproc (from Penguin Computing?). Dolphin had an SCI-based hardware solution for gluing together distributed systems, at least until recently. The Plan 9-ish Xcpu service was described as building on work with Bproc, but explicitly wasn't SSI.

_wmd · on July 20, 2019

If you can't answer this question for yourself, why do you think you'd be capable of modifying the factory default? Memory management is one of the most complex pieces of a modern OS

_wmd · on July 18, 2019

The BBC article mentions many died on the stairs to the roof. Sounds like a locked door could have been involved

konart · on July 18, 2019

Well, many things come here into play.

1) This is an animations studio. Many things are still done by hand - so they have lots of paper, paint, ink, etc you name it. Everything produces toxic smoke

2) Animations studio usually is an open space, pretty damn cramped to begin with and with lots of tools, tables, source materials and other things lying around. I'd imaging it is not the place you will find most suitable for an emergency evacuation.

3) People were going up. So did the smoke. It takes you just a few deep breaths to lose consciousness. It is possible they simple didn't even make it to the door.

vanderZwan · on July 18, 2019

Not necessarily - dying in badly ventilated staircases is a very common cause of death during fires in large buildings. During a mandatory fire safety training I was told that often it is better to stay in your room, close the door to the hallway (treat it as a fire barrier) and go to the window where the firefighters can come get you.

Cthulhu_ · on July 18, 2019

What I heard / read is that the perpetrator blocked the emergency exits.

PhasmaFelis · on July 18, 2019

And set fires in the stairwells.

tomatotomato37 · on July 18, 2019

There are rumors going about the arsonist jammed/ignited the fire escapes closed before the attack

fiblye · on July 18, 2019

Japanese buildings also are massively lacking in emergency exits. Almost no building would pass US safety codes or even come close to it.

gitrebase · on July 18, 2019

> Japanese buildings also are massively lacking in emergency exits.

This does not sound right. On the contrary, I have always seen emergency exits in Japanese buildings.

aikinai · on July 18, 2019

Where did you come up with this? Japan has incredibly strict safety regulations and building codes.

fiblye · on July 18, 2019

The place I work at is about the same size as this office and only has one exit (and that's the main door). Many of the facilities I visit for work are similarly lacking.

Maybe the US really goes over the top, but there are plenty of places where I'd feel pretty screwed if some sort of disaster occurred.

_wmd · on July 17, 2019

After the smear campaign and mass roasting he received less than 2 months ago, it's not hard to understand why any charitable intentions he had may have completely dried up

claudeganon · on July 17, 2019

Context? I follow a lot of Python news, but never saw anything about this.

dwaltrip · on July 18, 2019

There was a bunch of hubbub around a preemptive promotion of the `pipenv` package to be _the_ package for managing dependencies in python.

There was also a period when development and releases was happening in a somewhat frenzied manner and some github issues were being handled / closed in an antagonistic fashion. I don't know how extensive these issues were, but I remember seeing a few examples that seemed to back up the claims.

At risk of potentially stepping into overly personal territory: As I understand it, Kenneth was unfortunately dealing with some mental health stuff during part of that time period. He posted a blog post that mentioned this.

Here's one related HN discussion: https://news.ycombinator.com/item?id=18612590

My bias: During my usage of pipenv, while it seemed very promising, I felt that it wasn't fully production ready and had some problematic warts.

hobofan · on July 18, 2019

> There was a bunch of hubbub around a preemptive promotion of the `pipenv` package to be _the_ package for managing dependencies in python.

AFAIK that never happened, and was only marketing from his side, and never an official stance from the PSF (or whoever would be the authoritative source). That's just one the many questionable things that has happened with him in the past.

dwaltrip · on July 18, 2019

I think there was language added to the front of some official python website (one of pypa's I think?) that gave that impression. It was later changed.

Yeah, it's true that the marketing around pypenv was definitely over the top.

_wmd · on July 17, 2019

Python 2.1 or 2.2, sure

codr7 · on July 17, 2019

Damn right, these days it's about as true as G not doing evil.

_wmd · on July 12, 2019

So in response to a catastrophic failure due to testing in prod, they're going to push out a brand new regex engine with an ETA of 2 weeks. Can anyone say testing in prod?

The constant use of 'I' and 'me' (19 occurrences in total) deeply tarnishes this report, and repeatedly singling out a responsible engineer, nameless or not, is a failure in its own right. This was a collective failure, any individual identity is totally irrelevant. We're not looking for an account of your superman-like heroism, sprinting from meeting rooms or otherwise, we want to know whether anything has been learned in the 2 years since Cloudflare leaked heap all across the Internet without noticing, and the answer to that seems fantastically clear.

jgrahamc · on July 12, 2019

This report is written by me, the CTO of Cloudflare. I say "I" throughout because organizational failings are my responsibilty. If I'd said "we" I imagine you'd be criticizing me for NOT taking responsibility.

If you read the report you'd see I do not blame the engineer responsible at all. Not once. I made that perfectly clear.

pvg · on July 12, 2019

I wonder if you are able to talk a bit about the development of the Lua-based WAF. I imagine the possible unbounded performance of feeding requests into PCRE must have occurred to you or others at the time - or at least, long before this outage.

I don't mean this as some sort of lame 'lol shoulda known better' dunk - stories about technical organizations' decision-making and tradeoff-handling are just more interesting than the details of how regexes typed in a control panel grow up to become Jira tickets.

jgrahamc · on July 14, 2019

I did a talk about this years ago: https://www.youtube.com/watch?v=nlt4XKhucS4

pvg · on July 14, 2019

It sounds like one of the primary factors was compatibility with existing (or customer-provided) mod_security rules, if I've understood 1.75x speed hyper-you right.

gfodor · on July 12, 2019

Wow, I'm amazed two people could read that writeup (yourself and myself) and come to two totally different conclusions.

Pushing out a brand new regex engine surely will go through the usual process. This doesn't seem like it will take a lot of time unless there are surprises. Cloudflare clearly has the infrastructure in place already to do a proper integration test for correctness test and rampup infrastructure to ensure it doesn't cause a global outage. The global nature of this outage was because the rampup infrastructure was explicitly not used as per the protocol.

I have no idea what you read where a single engineer was singled out. At several points in this post mortem the author identifies that the regex being written by the individual involved was far from the only cause of the outage. This is a very textbook blameless post mortem doc afaict.

The narrative about the actions taken and meetings which were in is also par for the course for a good post mortem since these variables are real, and should be addressed by remediation items if they contributed to the outage. (For example, is it sane that the entire engineering team was synchronously in a meeting? Probably not.)

kenforthewin · on July 12, 2019

It seems we're reading different blog posts. Under the "What went wrong" section there are 11 points, all with differing levels of responsibility and ownership. He did well to identify the collective nature of this failure.

staticassertion · on July 12, 2019

I don't see why switching to a new regex implementation would be so scary. 2 weeks to test that your regexes don't break seems fine? Seems like a long time tbh.

On top of that they're switching to more constrained regex engines. Rust's regex engine makes guarantees about its running time, something that would have directly mitigated a portion of the issue. And it isn't as if RE2/Rust regex aren't in use anywhere, rust's regex engine is integrated into vscode, for example.

dang · on July 12, 2019

Personal attacks aren't allowed on HN, and please don't post in the flamewar style here generally.

https://news.ycombinator.com/newsguidelines.html

xxyeah · on July 12, 2019

You are overreacting and protecting your preferred people. What is HN running on again?

If this is a personal attack, there are literally 10-50 of these per day in arbitrary threads.

dang · on July 12, 2019

That comment was breaking the site guidelines, quite badly in fact. We moderate comments like that the same way regardless of who or what they're about.

> there are literally 10-50 of these per day in arbitrary threads

If you can find cases of this where moderators didn't respond, I'd like to see links. The likeliest explanation is simply that we didn't see it. We don't come close to seeing everything that gets posted here, so we depend on users, via flagging (https://news.ycombinator.com/newsfaq.html) or by emailing hn@ycombinator.com.

> What is HN running on again?

I suppose I have to answer this or someone will concoct a sinister reason why I didn't. HN doesn't run on Cloudflare.

Operyl · on July 12, 2019

You can easily duplicate traffic into a test infrastructure that wouldn't affect the production environment, and you're acting as if re2 et al hasn't had plenty of testing too. 2 weeks with the level of traffic (test data) that Cloudflare gets seems pretty realistic.

_wmd · on July 11, 2019

hey! Do you know if anyone is working on getting bpftrace to use BTF yet? Looks like one of the final chunks landed in Linux 5.2

dustfinger · on July 11, 2019

See Bpftrace for Linux 2018:

> https://news.ycombinator.com/item?id=18168137

pzakah asks:

> You've mentioned that we do have BTF now in Linux 4.18. I've tried to find if it was leveraged in bpftrace, but it looks like it isn't yet.

Brendan responds:

> That's the old repo (we should add a note to it pointing people to https://github.com/iovisor/bpftrace instead!)

Alastair added struct support for kprobes yesterday, based on the functionality in bcc (which bpftrace uses). That was the final missing piece, and why I'm posting about it now. See the last example here:

https://github.com/iovisor/bpftrace/blob/ma

--- ....

I took a look and according to the last example mentioned they have not added full struct support yet.

mmarchini · on July 11, 2019

Yes, we have an open Pull Request for that: https://github.com/iovisor/bpftrace/pull/734

leodido · on July 11, 2019

AFAIK not.

_wmd · on July 9, 2019

I'd imagine for a client, most of the value of this bank is access to what I can only imagine to be the most fantastically exotic network it sits at the centre of

walrus01 · on July 9, 2019

As a person that is definitely not part of the global one percent, my imagination ran to something like the Eyes Wide Shut orgy scene.

Fidelio.

_wmd · on July 7, 2019

There is quite a lot of surface area between a program and the kernel - C library, dynamic linker, system call interface, memory layout, droves of permission and sanity checking logic etc. that would need to be updated too. A 64 bit kernel is a first step