More

mark_undoio · 2025-07-20T10:35:56 1753007756

It's quite weird having local footpaths and paved roads that turn out to have been constructed by the Romans originally - around here that also applies to canals, drainage ditches, etc. It just blends into modern reality.

I imagine some of the Roman stuff was built on even older roads and channels.

mr_toad · 2025-07-20T12:11:21 1753013481

> I imagine some of the Roman stuff was built on even older roads and channels.

Watling street was paved by Romans, but at least part of the route was a road used by local Britons long before the Romans invaded.

mark_undoio · 2025-07-07T14:27:15 1751898435

> I don't understand this. Developer time is so much more expensive than machine time. Do companies not just double their CI workers after hearing people complain? It's just a throw-more-resources problem.

I'd personally agree. But this sounds like the kind of thing that, at many companies, could be a real challenge.

Ultimately, you can measure dollars spent on CI workers. It's much harder and less direct to quantify the cost of not having them (until, for instance, people start taking shortcuts with testing and a regression escapes to production).

That kind of asymmetry tends, unless somebody has a strong overriding vision of where the value really comes from, to result in penny pinching on the wrong things.

mike_hearn · 2025-07-07T14:50:19 1751899819

It's more than that. You can measure salaries too, measurement isn't the issue.

The problem is that if you let people spend the companies money without any checks or balances they'll just blow through unlimited amounts of it. That's why companies always have lots of procedures and policies around expense reporting. There's no upper limit to how much money developers will spend on cloud hardware given the chance, as the example above of casually running a test 10,000 times in parallel demonstrates nicely.

CI doesn't require you to fill out an expense report every time you run a PR thank goodness, but there still has to be a way to limit financial liability. Usually companies do start out by doubling cluster sizes a few times, but each time it buys a few months and then the complaints return. After a few rounds of this managers realize that demand is unlimited and start pushing back on always increasing the budget. Devs get annoyed and spend an afternoon on optimizations, suddenly times are good again.

The meme on HN is that developer time is always more expensive than machine time, but I've been on both sides of this and seen how the budgets work out. It's often not true, especially if you use clouds like Azure which are overloaded and expensive, or have plenty of junior devs, and/or teams outside the US where salaries are lower. There's often a lot of low hanging fruit in test times so it can make sense to optimize, even so, huge waste is still the order of the day.

mark_undoio · 2025-07-07T09:51:47 1751881907

Process recording by time travel debug seems like a good fit for this problem - then you can capture 100% of process execution and then go back and investigate further.

We (Undo.io) came up with a technique for following a tree of processes and initiating process recording based on a glob of program name. It's the `--record-on` flag in https://docs.undo.io/UsingTheLiveRecorderTool.html. You can grab a free trial from our website.

For open source, with rr (https://rr-project.org/) I think you'd just `rr record` the initial process and you'll end up capturing the whole process tree - then you can look at the one you're interested in.

As others have said you could also do some smart things with GDB's follow-fork settings but I think process recording is ideal for capturing complicated situations like this as you can go and review what happened later on.

mark_undoio · 2025-07-04T17:43:07 1751650987

I'm fascinated by this paper because it feels like it could be a good analogue for "can LLMs handle a stateful, text-based tool". A debugger is my particular interest but there's no reason why it couldn't be something else.

To use a debugger, you need:

* Some memory of where you've already explored in the code (vs rooms in a dungeon)

* Some wider idea of your current goal / destination (vs a current quest or a treasure)

* A plan for how to get there - but the flexibility to adapt (vs expected path and potential monsters / dead ends)

* A way for managing information you've learned / state you've viewed (vs inventory)

Given text adventures are quite well-documented and there are many of them out there, I'd also like to take time out to experiment (at some point!) with whether presenting a command-line tool as a text adventure might be a useful "API".

e.g. an MCP server that exposes a tool but also provides a mapping of the tools concepts into dungeon adventure concepts (and back). If nothing else, the LLM's reasoning should be pretty entertaining. Maybe playing "make believe" will even make it better at some things - that would be very cool.

alwa · 2025-07-04T19:03:20 1751655800

That’s a delightful concept to think about! I’m not sure what conceptual information the translation layer would add to the LLM’s internal representation of the state space.

But the broader concept of asking it to translate something structurally to a different domain, then seeing how the norms of that domain cause it to manipulate the state differently… that tickles my fancy for sure. Like you said, it sounds cool even in an art-project sense just to read what it says!

vladimirralev · 2025-07-04T20:18:34 1751660314

I've seen both replit and cline agents iteratively debug hard problem with massive amount of log lines. They can do it already.

mark_undoio · 2025-07-04T22:31:19 1751668279

That's the thing though - they're using logs. My theory is that LLMs are intrinsically quite good at that because they're good at sifting text.

Getting then to drive something like a debugger interface seems harder from my experience (although the ChatDBG people showed some success - my experiments did too, but it took the tweaks I described).

My experiments are with Claude Opus 4, in Claude Code, primarily.

throwaway81523 · 2025-07-04T21:39:52 1751665192

Look also at Delta Debugging which didn't need an LLM.

mark_undoio · 2025-07-04T17:32:45 1751650365

> It will be interesting to know what challenges came up in nudging the model to work better with time travel debug data, since this data is novel and the models today might not be well trained for making use of it.

This is actually quite interesting - it's something I'm planning to make a future post about.

But basically the LLM seems to be fairly good at using this interface effectively so long as we tuned what tools we provide quite carefully:

* Where we would want the LLM to use a tool sparingly it was better not to provide it at all. When you have time travel debugging it's usually better to work backwards since that tells you the causality of the bug. If we gave Claude the ability to step forward it tended to use it for everything, even when appropriate.

* LLMs weren't great at managing state they've set up. Allowing the LLM to set breakpoints just confused it later when it forget they were there.

* Open ended commands were a bad fit. For example, a time travel debugger can usually jump around in time according to an internal timebase. If the LLM was given access to that, unconstrained, it tended to just waste lots of effort guessing timebases and looking to see what was there.

* Sometimes the LLM just wants to hold something the wrong way and you have to let it. It was almost impossible to get the AI to understand that it could step back into a function on the previous line. It would always try going to the line, then stepping back, resulting in an overshoot. We had to just adapt the tool so that it could use it the way it thought it should work.

The overall result is actually quite satisfactory but it was a bit of a journey to understand how to give the LLM enough flexibility to generate insights without letting it get itself into trouble.

mark_undoio · 2025-07-04T16:51:00 1751647860

I am appalled and delighted by this.

It feels like an AI cousin to the Python error steamroller (https://github.com/ajalt/fuckitpy).

Whenever I see this sort of thing I think that there might be a non-evil application for it. But then I think ... where's the fun in that?

femto113 · 2025-07-04T19:54:23 1751658863

I share your feelings. What it most brings to mind for me is the infamous StackSort from the image alt text on XKCD comic 1185 (https://xkcd.com/1185/)

mark_undoio · 2025-06-22T22:27:35 1750631255

Always fun to see what izabera has come up with - every single time I'm somewhere between delighted and terrified to see what she's made the computer do this time!

mark_undoio · 2025-06-22T12:48:34 1750596514

I assume you're aware but there are some open source assistants that can get you this (at least to some extent).

E.g. Codename Goose https://block.github.io/goose/docs/quickstart/

Which I think supports Gemini among with all the other major AI providers, plus MCP. I have heard anecdotally that it doesn't work as well as Claude Code, so maybe there are additional smarts required to make a top notch agent.

I've also heard that Claude is just the best LLM at tool use at the moment, so YMMV with other models.

mark_undoio · 2025-06-22T12:39:24 1750595964

Thanks for this - I've been using the MCP Inspector https://modelcontextprotocol.io/docs/tools/inspector but find it doesn't really fit my workflow.

I like the fact this mcp-debug tool can present a REPL and act as a mcp server itself.

We've been developing our MCP servers by first testing the principle with the "meat robot" approach - we tell the LLM (sometimes just through the stock web interface, no coding agent) what we're able to provide and just give it what it asks for - when we find a "tool" that works well we automate it.

This feels like it's an easier way of trying that process - we're finding it's very important to build an MCP interface that works with what LLMs "want" to do. Without impedance matching it can be difficult to get the overall outcome you want (I suspect this is worse if there's not much training data out there that resembles your problem).

mark_undoio · 2025-05-24T23:53:34 1748130814

What's limiting us is that Undo does need a Linux kernel - so traditional embedded programming wouldn't be a fit. Embedded Linux could work and we do support ARM64.

I've thought I bit about how you might support time travel on bare metal embedded - but actually there are hardware-assisted solutions (Lauterbach's Trace32 was one we came across) there sometimes.