cgaebel's comments

cgaebel · on April 29, 2022

Thanks for working with Jespen. Being willing to subject your product to their testing is a huge boon for Redpanda's credibility.

I have two questions:

1. How surprising were the bugs that Jepsen found?

2. Besides the obvious regression tests for bugs that Jepsen found, how did this report change Redpanda's overall approach to testing? Were there classes of tests missing?

rystsov · on April 29, 2022

It wasn't a big surprise for us. Redpanda is a complex distributed system with multiple components even at the core level: consensus, idmepotency, transactions so we were ready that something might be off (but we were pleased to find that all the safety issues were with the things which were behind the feature flags at the time).

Also we have internal chaos test and by the time partnership with Kyle started we already identified half of the consistency issues and sent PRs with fixes. The issues got in the report because by the time we started the changes weren't released yet. But it is acknowledged in the report

> The Redpanda team already had an extensive test suite— including fault injection—prior to our collaboration. Their work found several serious issues including duplicate writes (#3039), inconsistent offsets (#3003), and aborted reads/circular information flow (#3036) before Jepsen encountered them

We missed other issues because haven't exercised some scenario. As soon as Kyle found the issues we were able to reproduce them with the in-house chaos tests and fix. This dual testing (jepsen + existing chaos harness) approach was very beneficial. We were able to check the results and give feedback to Kyle if he found a real thing or if it looks more like an expected behavior.

We fixed all the consistency (safety) issues, but there are several unresolved availability dips. We'll stick with Jepsen (the framework) until we're sure we fixed then too. But then we probably rely just on the in house tests.

Clojure is very powerful language and I was truly amazed how fast Kyle for able to adjust his tests to new information but we don't have clojure expertise and even simple tasks take time. So it's probably wiser to use what we already know even it it a bit more verbose.

cgaebel · on April 23, 2022

We are similarly sad about how unavailable Intel PT is in VMs. In 2022, being unavailable on Macs and VMs raises the barrier to entry extraordinarily high for many people in our target audience. Not sure if working outside of work is your cup of tea, but we've found Intel NUCs to be <$1000 and an unobtrusive way to play with these features at home.

Good point about overhead. I've moved the 2%-10% number front and center, and wrote up a bit more detail about where that comes from in a new wiki page: https://github.com/janestreet/magic-trace/wiki/Overhead

We'll think about adding flame graphs. We unfortunately have little experience writing responsive web UIs, the excellent Perfetto developers did all of the heavy lifting on that front. But who knows, maybe an enterprising Open Source Contributor could help us out. I see Matt Godbolt was asking questions in their discord the other day...

lalitmaganti · on April 23, 2022

(Perfetto developer here)

The Perfetto UI already supports flamegraphs btw (we use it for memory profiling and CPU stack sampling). We've never bothered to implement it for userspace slices because we've never had high frequency data there to make that a worthwhile view of the data.

Contributions for this upstream are very welcome :)

cgaebel · on April 23, 2022

You basically understand how this works; you see everything, but there might be gaps in the trace. In our experience they're rare (< 10 per multi-millisecond trace), short lived, and magic-trace can mostly infer what happened in that period fairly easily. You'll see these show up in the final trace as a little arrow that says "Decode error: Overflow packet" when you click on it, and the trace might look a little wonky (hopefully not too wonky!) from that point on.

In fact, if you look carefully at the demo gifs in the README, that trace had 5 decode errors! Nonetheless, it was extremely usable.

Snapshot sizes are configurable--you can go back as far as you like. However, the trace viewer tends to crash when the trace files reach the hundreds of MB and you'll need to do some work to set up a trace processor outside of your browser for the UI to connect to. The UI will offer up some docs if you actually run into this.

I'm so glad you asked us about PMU events, we've been thinking a lot about those. These are available in traces of the efficiency cores of Alder Lake CPUs, but nothing else. When we get our hands on a server class part with PMU tracing we'll add support ASAP. We conjecture that it will be absurdly useful to see cache events on a timeline next to call stacks.

cgaebel · on April 23, 2022

Broadwell works, Skylake or later works better. We go into more detail about what platforms we support and why in https://github.com/janestreet/magic-trace/wiki/Supported-pla...

cgaebel · on April 22, 2022

To be clear: bitcharmer says "we" to mean "fellow HFTs" not "Jane Street".

bitcharmer · on April 22, 2022

Yes. Thanks, should have made that more explicit.

cgaebel · on April 22, 2022

It can! You can read more about how to do that yourself here: https://perf.wiki.kernel.org/index.php/Perf_tools_support_fo....

magic-trace uses perf. If you want, you can think of it as a mere "alternative frontend" for the Intel PT decoding offered by perf.

DSingularity · on April 22, 2022

Ah okay, I misunderstood magic-trace to be an alternative to perf.

cgaebel · on April 22, 2022

Hi HN! I'm Clark, one of the maintainers of magic-trace.

magic-trace was submitted before, our first announcement was this blog post: https://blog.janestreet.com/magic-trace/.

Since then, we've worked hard at making magic-trace more accessible to outside users. We've heard stories of people thinking this was cool before but being unable to even find a download link.

I'm posting this here because we just released "version 1.0", which is the first version that we think is sufficiently user-friendly for it to be worth your time to experiment with.

And uhh... sorry in advance if you run into hiccups despite our best efforts. Going from dozens of internal users to anyone and everyone is bound to discover new corners we haven't considered yet. Let us know if you run into trouble!

zeusk · on April 22, 2022

Windows has Windows Performance Analyzer, GPUView and PIX so most game devs are covered on that front :)

djmips · on April 23, 2022

Do people still use GPUView? It hasn't seen a lot of development in years AFAIK and wondered if it was still working and useful.

PIX is great! It gets regular updates and an active and responsive Discord channel.

zeusk · on April 24, 2022

We use it internally, I'm not entirely certain of external usage. It still works and is good for tracking command packet scheduling and inter-process wait chains.

Yup, PIX is THE tool for game developers. Direct3D team also has a very responsive discord channel :)

gigatexal · on April 22, 2022

I’m probably missing something so apologies if this is obvious: does this only work on compiled programs or could it work on any arbitrary running code. Everything from Firefox to my random python script?

tbr1 · on April 22, 2022

It works best on compiled programs.

We do try to support scripted languages with JITs that can emit info about what symbol is located where [1]. Notably, this more or less works for Node.js. It'll work somewhat for Python in that you'll see the Python interpreter frames (probably uninteresting), but you will see any ffi calls (e.g., numpy) with proper stacks.

[1]: https://github.com/torvalds/linux/blob/master/tools/perf/Doc...

giovannibonetti · on April 22, 2022

I'm curious about why Standard ML (SML) was chosen for this project, given the track record Jane Street has with OCaml. Do you seen an advantage for using the former in this kind of project?

tbr1 · on April 22, 2022

It's all OCaml, GitHub is just misclassifying it as SML :)

mananaysiempre · on April 23, 2022

Hint[1] in case you’re ever in this situation:

  echo '*.ml  linguist-language=OCaml' >> .gitattributes
  echo '*.mli linguist-language=OCaml' >> .gitattributes

[1] https://github.com/github/linguist/blob/master/docs/override...

cgaebel · on April 23, 2022

Thank you for this. I've made the change, but it looks like it may be several days before github gets around to refreshing the language statistics.

la6472 · on April 22, 2022

So one needs to upload the trace log to your website to visualize? Any way to do it locally?

tbr1 · on April 22, 2022

Absolutely, check out https://github.com/janestreet/magic-trace#privacy-policy and https://github.com/janestreet/magic-trace/wiki/Setting-up-a-.... With a bit of extra configuration, magic-trace can host its own UI locally. You just need to build the UI from source, and point magic-trace to it (via an environment variable).

TooSmugToFail · on April 22, 2022

Awesome work Clark!

Any plans to support Arm in the future? Thanks!

tbr1 · on April 22, 2022

We don't have plans to add ARM support largely because we have no in-house expertise with ARM. That said, ARM has CoreSight which sounds like it could support something like magic-trace in some form, and we'd definitely be open to community contributions for CoreSight support in magic-trace.

geuis · on April 22, 2022

On the website, scrolling doesn't work in mobile safari.

panosfilianos · on April 22, 2022

Thanks for sharing this.

b20000 · on April 22, 2022

why do you guys use Caml?

cgaebel · on April 22, 2022

https://www.youtube.com/watch?v=v1CmGbOGb2I

b20000 · on April 22, 2022

that seems to be a presentation about language features. I'm mostly interested in the business reasons for using the language within what jane street does, and how the language offers a competitive advantage and why it is "good enough" for the highly competitive HFT landscape they work in.

brobinson · on April 22, 2022

The language features are the competitive advantage.

nesarkvechnep · on April 22, 2022

Because Java is not the be-all and end-all.

b20000 · on April 22, 2022

why use java when you have c++?

ARandomerDude · on April 22, 2022

So you can use Clojure. ;-)

cgaebel · on April 19, 2022

Hi! We've just released a new version of magic-trace. Happy to answer questions here.

cgaebel · on May 12, 2017

It's happened! We have some internal magic that lets us run ml directly: ./my_bash_script_replacement.ml, and it makes replacements fairly painless.

But we don't do it often. Each language has its strengths and weaknesses, and the constraints of a small script tend not to change that much over time. The script either starts other executables and pipes some text around, or it does some computation. The choice of which to use is often (but not always) clear.

yminsky · on May 12, 2017

I'm not sure I agree with the "don't do it often". Having jane-script (which is our OCaml scripting system) has allowed us to greatly reduce our dependence on Bash.

That said, there are still little things we use Bash for. But our tolerance for large bash scripts has diminished greatly over the years.

cgaebel · on April 3, 2015

You must not live in San Francisco.

dragonwriter · on April 3, 2015

> You must not live in San Francisco.

No, but I'm rather familiar with it. In any case, I don't see how that's relevant; the context of the discussion is HN guidelines, not guidelines for San Francisco.