Hacker News new | past | comments | ask | show | jobs | submit login
A Practical Guide to Watchdogs for Embedded Systems (memfault.com)
104 points by fra on Feb 19, 2020 | hide | past | favorite | 29 comments



One neat thing I've seen that doesn't get called out enough, is a high priority timer that has a slightly smaller period than your watchdog. When you let the watchdog, you pet this timer too. Then in the timer ISR, write out the trap frame to brain dead non volatile memory (we had battery backed SRAM, and then MRAM on newer boards). Then when the board reboots, and checks in, you can pull down what it was doing when the watchdog triggered.


It's mind-boggling to me that many projects I've heard about and worked on treat (hardware) watchdogs issues as impossible to debug, for the exact reason you mentioned...they didn't set up a software watchdog.

Doing some basic groundwork and implementing the timer/software watchdog, writing out to logs when a watchdog is about to trigger, and not disabling the watchdog during development are all basic things teams can do to more easily catch these issues before shipping firmware.

JSYK: I do believe this was mentioned in the post as well, under the section 'Adding a “Software” Watchdog', but when a little bit further and implemented a per-task watchdog.


There are hardware watchdog implementations that offer this feature natively. There are two configurable intervals, a "soft" and a "hard" watchdog deadline. If you miss the soft deadline, you get a NMI, and you make the interrupt handler log whatever it can. If you miss the hard deadline, the whole thing resets.

Fun story: I've only had this fail on me once. There was a hardware bug in the interrupt controller which caused the wrong vector to be invoked for NMIs.


Yes!

I implemented something similar on a particularly annoying 8/16-bit controller just a few weeks ago. Extra fun since it had no instruction to read the program counter (and no general purpose register wide enough to hold it).

I hope ARM adds that (or IC vendors like ST if possible) to the core itself, just mirror the PC to a register on reset. Should be trivial in hardware.


Hmm, maybe you could possibly instrument the functions[1] to write the function address to a register, presuming you have enough program space. Presuming you're in C and your compiler supports that.

1: https://mcuoneclipse.com/2015/04/04/poor-mans-trace-free-of-...


I was surprised at the poor state of watchdogs in PC-class Linux systems. I needed one recently and was bummed at the state of the old watchdog daemon / softdog kernel module. It works, but it is not nearly as easy to get going (on Ubuntu) as I expected. systemd also has its own watchdog and I can't figure it out.

Anyway turns out I really needed a full PC hardware watchdog. I ended up buying some $8 anonymous piece of Chinese hardware that's USB powered. It hits the motherboard reset switch if the motherboard hard drive activity light hasn't flashed in awhile. Dumb thing, but it seems to work.


Almost all consumer motherboards have a chipset with embedded hardware watchdog that's well supported by respective drivers in FreeBSD / Linux / etc. Some vendors like Intel may lock down that functionality, but most (Asus, Gigabyte, etc) don't.


Do you have any more info on that chipset watchdog and how I'd use it in Linux? Is it this, the TCO Watchdog? http://www.madore.org/~david/linux/iTCO-wdt-test.html


Yes, iTCO_wdt is for Intel chipsets; sp5100_tco is for AMD chipsets (including FCH).

The names are a bit weird to my taste. FreeBSD ones sound a little bit better (but maybe I am just more used to them): ichwd (ICH WD) and amdsbwd (AMD SouthBridge WD).


Thanks! I somehow spent several hours researching how to run watchdogs on Linux without finding this. Seems like exactly what I need. While I'm here, systemd's watchdog docs are also pretty good: http://0pointer.de/blog/projects/watchdog.html


I found the same thing. Last time I needed a reliable PC watchdog it was an ISA (really PC/104) bus card that triggered a relay (causing reset) if a timer expired and a set of conditions hadn't been met.


One scenario missed in the list of causes is a corrupted runtime image.

An advanced topic is enabling support for the watchdog in your bootloader and having a defined recovery path when the system fails to load or, worse, the application falls into a boot loop.

If you have the space, you can fall back to a recovery image or duplicate of the application. If you don’t have the space, falling into a DFU mode is a good plan.


Watchdogs are one of the more frustrating types of issues to debug. Chris's overview of how to implement them properly, and investigate resets is an amazing resource I wish I had earlier in my career.


Heh, I work in embedded and, IMO, they were one of the most fun issues to debug in the begging of my career. Mostly, if I had such an issue, I enabled the ISR triggering and just before the context got trashed, I looked where the program was. Things got pretty obvious then.

Once you get the hang of it, it becomes pretty straightforward. And it's great training on how the HW looks like and functions!

Granted, on more lower powered systems, with fewer, more primitive debugging options (looking at you, PIC) things miiight look a bit more painful. Thankfully, such systems are (much) smaller.

IMO, the _most_ painful issues to debug are memory corruption issues happening on large systems without MPU enabled.


You’re right, if it reproduces at your desk you’ve got some good tools at your disposal.

Collecting enough state to fix them from a customer report is tricky, I would say.


On the other side even the smallest PICs and Atmels have sooo much EEPROM and flash memory available. I am still amazed how much better these chips got in the last decade. You have all the I/O you could want, nearly endless memory and the chip costs less than a bubblegum.

In smaller systems for memory checks I always have a complete self-check routine at boot. Although I cannot even remember when I had the last error for internal memory. The quality seems to be quite high, even after many read/write cycles. Never implemented that for a system with virtual memory.

What I mostly miss in embedded systems isn't debugging, it is having real unit tests without having the target board and all the periphery installed in a laboratory, which is the only location I can feasibly run any tests. And those are mostly restricted to integration tests.


The watchdog is a nice feature to have a borked system reboot, lifesaver in the field if feces hits the fan.

What's less fun is if there is too little protection against electrostatic fields/EMI on the JTAG clock pin. On the small cortex m-class devices we work with, some of them can't shut off the JTAG part of the chip, meaning that when operating, if there are enough (I think 8) logic flips on the TCK pin in _any_ amount of time, the JTAG part wakes up, sets the HALT ON BOOT flag. Next time the device reboots (due to firmware update, or watchdog, ...), it will stop and stay in JTAG debug mode. Not nice. You need to manually power cycle the thing.

We detect this by periodically checking the JTAG power domain, and if it is on, tell the server this so that we avoid rebooting it (eg automatically after firmware update). This way we've found poor hw implementations and tough EMI environments by proxy of JTAG power domain :D.


I work with Cortex-M since they launched and I've never seen on heard about this problem. Of course, I don't know the kind of environment your devices work, but many of my projects run in automotive.

TCK has usually a pull-up/termination (but I've also seen pull-downs with/without caps). You see this issue even with the pull-up/down?


This was primarily in a lighting solution, and it wasn't the only problem with that early hw. We noticed lights not coming back online after a firmware update without a power cycle. Since it was very early hw, IIRC it didn't have any shielding, and also IIRC no pull on TCK.

We now see this primarily on 1st/early iteration hw and on prototypes, but checking this proactively have saved us from a lot of headaches of the type "why doesn't it come back online - hw, or something in the update, or what...".


A neat enough article, but surprised it didn't talk about an electronic watchdog: basically pulsing a gpio pin to trigger a recharge of a cap which holds a transistor active for a second or so, and that transistor drives eg a relay. An alternate method uses a gpio to reset a 555 timer. This will allow machinery to cut off when the embedded circuit stops looping. That is, any attached machinery would have a guaranteed NO (normally open) circuit and can only be engaged when all the watchdogs are working properly.

Some mcu pins also go into an unknown state (neither guaranteed high nor low), so resetting a cpu can have bad consequences if it's driving big machinery, if not designed correctly.

One project I had a pc sending software watchdog pings to several independent devices and each of those had an actual hardware watchdog (as opposed to the cpu resetting one in the article). I used the watchdog to physically control the power to contactors: no watchdog = no power = nothing activates.

The system controlled firing of gas burners and fans etc, but the design was very safe, heaps of redundancy and was guaranteed to fail into a safe mode at any instant.


I would like to find a reliable software watchdog that kills a process when a timer expires (for preventing zombie processes from violating lease timeouts).


I mean, you probably shouldn't rely on the local system time for determining lease timeouts for correctness, if it matters.


I mean something like CLOCK_MONOTONIC.


Don't you still need to worry about the clock not going up in pace with others? Like if a system has a lease until "100" but the clock doesn't tick (or more realistically ticks slowly), then another system could think it had the lease if it observed a local system time of 101? Maybe I'm misunderstanding how you're using the leases though.


Yes you do, and normally an upper bound epsilon for clock rate skew is explicitly assumed. That's obviously fragile but there are many highly available systems that have managed to get away with it, partly by keeping leases short.


Why can't you do this which plain old C and the POSIX APIs for timers and process control? i.e. roll your own?


Never in my career have I ever heard of any other term used than "kicking" the watchdog. Are the other terms popular in America?


Why would you kick the poor thing? :(

Not in America but terms I've heard include feeding the watchdog, keeping it alive, petting, greeting, barking at it, calling it and, my favourite, shushing it.


This was an awesome read. Thanks for writing this.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: