google-cloud-platform google-cloud-logging google-cloud-vertex-ai

How to create a Logs Router Sink when a Vertex AI training job failed (after 3 attempts)?

I am running a Vertex AI custom training job (machine learnin training using custom container) on GCP. I would like to create a Pub/Sub message when the job failed so I can post a message on some chat like Slack. Logfile (Cloud Logging) is looking like that:

{
insertId: "xxxxx"
labels: {
ml.googleapis.com/endpoint: ""
ml.googleapis.com/job_state: "FAILED"
}
logName: "projects/xxx/logs/ml.googleapis.com%2F1113875647681265664"
receiveTimestamp: "2021-07-09T15:05:52.702295640Z"
resource: {
labels: {
job_id: "1113875647681265664"
project_id: "xxx"
task_name: "service"
}
type: "ml_job"
}
severity: "INFO"
textPayload: "Job failed."
timestamp: "2021-07-09T15:05:52.187968162Z"
}

I am creating a Logs Router Sink with the following query:

resource.type="ml_job" AND textPayload:"Job failed" AND labels."ml.googleapis.com/job_state":"FAILED"

The issue I am facing is that Vertex AI will retry the job 3 times before declaring the job as a failure but in the logfile the message is identical. Below you have 3 examples, only the last one that failed 3 times really failed at the end.

In the logfile, I don't have any count id for example. Any idea how to solve this ? Creating a BigQuery table to keep track of the number of failure per resource.labels.job_id seems to be an overkill if I need to do that in all my project. Is there a way to do a group by resource.labels.job_id and count within Logs Router Sink ?

Solution

The log sink is quite simple: provide a filter, it will publish in a PubSub topic each entry which match this filter. No group by, no count, nothing!!

I propose you to use a combination of log-based metrics and Cloud monitoring.

Firstly, create a log based metrics on your job failed log entry
Create an alert on this log based metrics with the following key values

Set the group by that you want, for example, the jobID (i don't know what is the relevant value for VertexAI job)
Set an alert when the threshold is equal or above 3
Add a notification channel and set a PubSub notification (still in beta)

With this configuration, the alert will be posted only once in PubSub when 3 occurrences of the same jobID will occur.