Search code examples
scalagoogle-cloud-platformgoogle-bigquerycats-effect

Big Query Job state done before the query finishes


I have a scala application that creates a tsv table using BigQuery. When a user tries to access the data I want to return it if the query job is finished, otherwise tell them it's still running.

My query job creation looks like this:

        bigQuery.create(
          JobInfo.of(
            QueryJobConfiguration
              .newBuilder(mySql)
              .setAllowLargeResults(true)
              .setDestinationTable(TableId.of("MyReports", s"${tableName}_$random".replace("-", "_")))
              .setWriteDisposition(JobInfo.WriteDisposition.WRITE_TRUNCATE)
              .setCreateDisposition(JobInfo.CreateDisposition.CREATE_IF_NEEDED)
              .setUseLegacySql(false)
              .build()
          )
        )

and the method to get the data looks like this:

override def getData(jobId: String): IO[Either[Throwable, String]] = {
  bigQueryService.getMyJob(jobId).map {
    case Right(None) | Right(Some(null)) => Right("Data not found, check provided job name")
    case Right(Some(r)) =>
      if (r.isDone) {
        Try(r.getQueryResults()
        .iterateAll()
        .asScala
        .map(_.asScala.map(_.getValue.toString).mkString("\t"))
        .mkString("\n")
        ).toEither
      } else {
        Right(s"Job not completed, current status is ${r.getStatus.getState.toString}")
      }
    case Left(err: Throwable) => Left(err)
  }
}

Where I'm using Cats Effect IO to evaluate at the end of computation. My problem is that the getQueryResults method on a job stalls until the query is finished. I try to prevent this by checking another method on a BQ Job: isDone. For some reason in my testing, isDone returns true before the query has completed. I see this myself when checking the BigQuery console. This results in a user's request always stalling until the query completes, instead of returning a message as intended.

How is it a job can be completed while the query still runs? Am I missing some distinction between jobs and queries? Or is there something else I've missed? Thank you for any suggestions you may have.


Solution

  • jobs.getQueryResults has an optional timeoutMs parameter which governs its hanging GET semantics. It waits for job completion up to that specified interval or 10 seconds if not specified. If the job is complete it returns immediately regardless.

    If you set the timeout to zero it will return the response immediate, and you can check the result stats to see if the job is complete. If so, the response should also contain the schema and first page of data rows.

    Additional information from the REST reference: https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/getQueryResults