Search code examples
gitgithubgithub-enterprise

Notification on failed GitHub WebHooks?


My company uses GitHub Enterprise to automatically update production and test servers when certain protected branches are updated.

When someone sends the push event, a payload is delivered to various servers, each running a small web server to receive such payloads. The web server then checks the "ref" element of the payload to see if the updated branch corresponds with the server.

For example, when someone sends the push event to the development branch, this is the start of the payload that the WebHook delivers to two servers, prod01 and dev01.

{
  "ref": "refs/heads/development",
  "before": "e9f64fa5a4bec5f68faf9533050097badf1c4c1f",
  "after": "e86956f39a26e85b850b81643332def33e7f15c6",
  "created": false,
  "deleted": false,
...
}

The prod01 server checks to see if the production branch was updated. It wasn't, so nothing happens on that server. The server dev01 checks the same payload to see if the development branch was updated. It was ("ref": "refs/heads/development"), so dev01 runs the following commands.

git -C /path/to/dev01/repo reset --hard
git -C /path/to/dev01/repo clean -f
git -C /path/to/dev01/repo pull origin development

When the payload is delivered correctly, GitHub Enterprise returns this.

Working payload

But sometimes the web server isn't running on prd01 or dev01, so we get this, instead.

Failed payload: "We couldn't deliver this payload: Service Timeout"

When this happens, our workflow of updating the repository and expecting that the server will have the same changes doesn't work.

How can I be notified for failed payloads? I'd rather not set up something to poll the web servers or poll for bad statuses, if that's possible. Barring that, any solution that checks the status (RESTfully?) of the payload is better than checking to see if the web server is still running, since the payload may still fail for other reasons.

Edit: I've checked internally and it looks like we could probably set up one of our current monitoring services to check for responses on the web server's port on each server. In the image above, it's 8090, but it frequently differs.

This isn't my ideal solution, since it only really covers the case when the web server is not responding. There are a variety of other reasons why the payload delivery might fail.


Solution

  • There are two options:

    Real-time Monitoring

    Configure log forwarding and monitor for failed events in hookshot_resque with error codes 422 or 504.

    Cron-based Monitoring

    Some user that has administrative shell access to your instance can check for failed events using the command line utility ghe-webhook-logs. For example:

    show all failed hook deliveries in the past day

    ghe-webhook-logs -f -a YYYYMMDD

    The next step is to parse and automate the command. While this introduces a delay in detecting a failed webhook, it is the most robust and reliable method available.