Search code examples
pythongoogle-bigquerygoogle-python-api

How to query a particular job id submitted in bigquery action


I have been using google_cloud for submitting a job to big query for extraction of the file to GCS like this:

dataset = self.bqClient.dataset(self.website_id)
table = dataset.table(table_name)
job_name = str(uuid.uuid4())
job = self.bqClient.extract_table_to_storage(
job_name, table, destination)
job.destination_format = "NEWLINE_DELIMITED_JSON"
job.compression = 'GZIP'
job.begin()

Here the job id or name is the uuid4. Basically I am collecting these job ids in a queue and want to check later if that job is in DONE state. How can I do that?

I have been searching for this but so far had no luck. I was only able to find the function - client.list_jobs() but it's the list of all jobs. I only want to query or search for one particular job.


Solution

  • You could just filter out by the name attribute which job you want.

    Let's say you want to get information about a job whose id is "big name string job 1". You can filter it out from the list of jobs by running:

    job_name = "big name string job 1"
    job = [job for job in list(self.bqClient.list_jobs()) if job.name == job_name][0] # this will break if list is empty
    
    print(job.state) # DONE or RUNNING
    

    Just make sure to update the client if possible, currently we are ate version 0.26.0

    [EDIT]:

    You said in your comment about having 1 million jobs. As far as running the get job method, currently the API does so only when running the exists and the reload methods so there's no way of running something like client.get_job(job_name).

    Still, in the list_job, you can send the arguments all_users and state_filter as in the code, something like:

    job_name = "big name string job 1"
    job = [job for job in list(self.bqClient.list_jobs(all_users=False, state_filter='done')) if job.name == job_name][0]
    

    Which lists jobs only for current user authorized in the client with given state.

    If still this lists millions, then you could still (somewhat "hacky" solution) query it directly from the job base constructor, something like:

    from google.cloud.bigquery.job import _AsyncJob
    job = _AsyncJob(job_name, self.bqClient)
    job.reload()
    print(job.state) #RUNNING or DONE
    

    Which is an indirect way of running a client.get_job().

    This might be an interesting feature request for the python repository.