Search code examples
google-cloud-platformgoogle-cloud-logginggoogle-cloud-vertex-ai

How to create a Logs Router Sink when a Vertex AI training job failed (after 3 attempts)?


I am running a Vertex AI custom training job (machine learnin training using custom container) on GCP. I would like to create a Pub/Sub message when the job failed so I can post a message on some chat like Slack. Logfile (Cloud Logging) is looking like that:

{
insertId: "xxxxx"
labels: {
ml.googleapis.com/endpoint: ""
ml.googleapis.com/job_state: "FAILED"
}
logName: "projects/xxx/logs/ml.googleapis.com%2F1113875647681265664"
receiveTimestamp: "2021-07-09T15:05:52.702295640Z"
resource: {
labels: {
job_id: "1113875647681265664"
project_id: "xxx"
task_name: "service"
}
type: "ml_job"
}
severity: "INFO"
textPayload: "Job failed."
timestamp: "2021-07-09T15:05:52.187968162Z"
}

I am creating a Logs Router Sink with the following query:

resource.type="ml_job" AND textPayload:"Job failed" AND labels."ml.googleapis.com/job_state":"FAILED"

The issue I am facing is that Vertex AI will retry the job 3 times before declaring the job as a failure but in the logfile the message is identical. Below you have 3 examples, only the last one that failed 3 times really failed at the end. enter image description here

In the logfile, I don't have any count id for example. Any idea how to solve this ? Creating a BigQuery table to keep track of the number of failure per resource.labels.job_id seems to be an overkill if I need to do that in all my project. Is there a way to do a group by resource.labels.job_id and count within Logs Router Sink ?


Solution

  • The log sink is quite simple: provide a filter, it will publish in a PubSub topic each entry which match this filter. No group by, no count, nothing!!

    I propose you to use a combination of log-based metrics and Cloud monitoring.

    1. Firstly, create a log based metrics on your job failed log entry
    2. Create an alert on this log based metrics with the following key values
    • Set the group by that you want, for example, the jobID (i don't know what is the relevant value for VertexAI job)
    • Set an alert when the threshold is equal or above 3
    • Add a notification channel and set a PubSub notification (still in beta)

    With this configuration, the alert will be posted only once in PubSub when 3 occurrences of the same jobID will occur.