I've repeatedly seen both high quality and low quality tooling act as massive fo...

jiggawatts · on June 26, 2022

Recently I showed a development team how to use a debug tool that can automatically collect debug snapshots from production that can be opened with one click in the IDE. It’ll jump to the line of code and show the state of everything — not just the stack but the heap too.

My demo is to pick a random crash and diagnose the root cause while talking. The mean time to resolution is low single digit minutes.

I showed this to an entire team, one person at a time, solving bugs as I went. I showed the junior devs, senior devs, and their manager.

No interest. None. Just… silence.

The tools are amazing, but the lack of motivation from the typical developers for learning to use them is even more amazing.

djbusby · on June 27, 2022

Pointer to the tool? It sounds awesome.

jiggawatts · on June 27, 2022

Azure App Insights can do Debug Snapshot collection, but it’s not the only such tool. It just makes the workflow quick and simple.

Ref: https://docs.microsoft.com/en-us/azure/azure-monitor/snapsho...

It works even better with DevOps source indexing added to the build pipeline:

https://docs.microsoft.com/en-us/azure/devops/pipelines/task...

A vaguely similar feature is Azure App Service memory leak diagnostic tools which take memory dumps on certain triggers or at intervals.

You can open these in Visual Studio and it’ll show the heap deltas over time.

tomcam · on June 27, 2022

Seconded!

laluser · on June 27, 2022

Playing devils advocate here. It could just be that they already have the stack trace in logs somewhere, so how does this help you any faster? Bulk of the time to resolution will be figuring out what to do anyways.

jiggawatts · on June 27, 2022

These were all bugs that were nearly impossible to solve with just a stack trace. For example, there was a "format" exception on a web page with hundreds of uses of string formatting (to generate a report). Another example was a function call complaining about a null argument. Which instance? The page had 40+ calls to the same function, each with 6 arguments.

Most of these couldn't be reproduced either. As in, you'd get a crash once a day in a page that would otherwise work successfully thousands of times.

How would you fix a problem where there's a stack trace only from a release? The scenario is: you can't reproduce the errors, you don't get line numbers, you don't even get function argument values.

I could solve these in minutes using this tool. Could you match that without such tooling?

randomifcpfan · on June 27, 2022

I’m guessing the organization incentives are against investigating/fixing random crashes. It might be unscheduled, or seen as “test” or “qa” or “ops” work. Try working with management to set up a way to reward the behavior you want to encourage.

infogulch · on June 26, 2022

> Only valuing easily-measurable work is ... a modern organizational disease