We[1] had a similar problem with clients reporting to us about lost callbacks[2]...

We[1] had a similar problem with clients reporting to us about lost callbacks[2] (our term for webhook). To solve it, we have built two options.

- Get a notification email everytime the callback fails. The email contains the same information the callback was supposed to deliver

- Retries. We retry for the next 24 hrs (max) with an interval of 5 mins or until the callback call succeeds (within those 24hrs). We created a sub-resource called `calls` (/callbacks/[id]/calls) that keep the status of the call we made. If it succeeds, the status changes to "SUCCESS", if it fails, it remains in "FAILED". If even after 24hrs the receiver system being down, and the call does not succeed, the developer can make a call to GET /callbacks/[id]/calls?status=FAILURE and receive all the failed calls. They can process the content and do a PUT /callbacks/[id]/calls?id=ID1&id=ID2&id=ID3... with body as `{ "status": "SUCCESS" }` to mark them as "SUCCESS".

The calls are saved for upto 7 days, so that the dev has enough time to fix their server issues, and get back all the lost callback calls. This solved much of the client issues.

* An added benefit of this came to the devs who could not get an inbound POST from us into their network due to firewall restrictions. The firewall restriction defeated the purpose of live callbacks, but with the `status` option, they only checked for new (`FAILED`) notifications once every 2 hrs or so , and mark the one processed with `SUCCESS`. This way, they only look for `FAILED` and process when they have one. Else, nothing to do.

[1] Whispir - https://www.whispir.com/ [2] https://whispir.github.io/api/#handling-callback-failures