I am running a Vertex AI custom training job
(machine learnin training using custom container) on GCP
. I would like to create a Pub/Sub
message when the job failed so I can post a message on some chat like Slack. Logfile (Cloud Logging)
is looking like that:
{
insertId: "xxxxx"
labels: {
ml.googleapis.com/endpoint: ""
ml.googleapis.com/job_state: "FAILED"
}
logName: "projects/xxx/logs/ml.googleapis.com%2F1113875647681265664"
receiveTimestamp: "2021-07-09T15:05:52.702295640Z"
resource: {
labels: {
job_id: "1113875647681265664"
project_id: "xxx"
task_name: "service"
}
type: "ml_job"
}
severity: "INFO"
textPayload: "Job failed."
timestamp: "2021-07-09T15:05:52.187968162Z"
}
I am creating a Logs Router Sink with the following query:
resource.type="ml_job" AND textPayload:"Job failed" AND labels."ml.googleapis.com/job_state":"FAILED"
The issue I am facing is that Vertex AI will retry the job 3 times before declaring the job as a failure but in the logfile the message is identical. Below you have 3 examples, only the last one that failed 3 times really failed at the end.
In the logfile, I don't have any count id for example. Any idea how to solve this ? Creating a BigQuery table to keep track of the number of failure per resource.labels.job_id
seems to be an overkill if I need to do that in all my project. Is there a way to do a group by resource.labels.job_id
and count within Logs Router Sink ?
The log sink is quite simple: provide a filter, it will publish in a PubSub topic each entry which match this filter. No group by, no count, nothing!!
I propose you to use a combination of log-based metrics and Cloud monitoring.
With this configuration, the alert will be posted only once in PubSub when 3 occurrences of the same jobID will occur.