Search code examples
python-3.xgoogle-bigquerygoogle-cloud-functionsetl

How to pass multiple delimiter in Python for BigQuery storage using Cloud Function


I am trying to load multiple csv files into a BigQuery table. For some csv files delimiter is comma and for some it is semicolon. Is there any way to pass multiple delimiter in Job config.

    job_config = bigquery.LoadJobConfig(
    autodetect=True,
    source_format=bigquery.SourceFormat.CSV,
    field_delimiter=",",
    write_disposition="WRITE_APPEND",
    skip_leading_rows=1,
)

Thanks Ritz


Solution

  • I deployed the following code in Cloud Functions for this purpose. I am using “Cloud Storage” as the trigger and “Finalize/Create” as the event type. The code works successfully for running Bigquery Load jobs on comma and semicolon delimited files.

    main.py

    def hello_gcs(event, context):
     from google.cloud import bigquery
     from google.cloud import storage
     import subprocess
    
    # Construct a BigQuery client object.
     client = bigquery.Client()
     client1 = storage.Client()
    
     bucket = client1.get_bucket('Bucket-Name')
     blob = bucket.get_blob(event['name'])  
    
    # TODO(developer): Set table_id to the ID of the table to create.
     table_id = "ProjectID.DatasetName.TableName"
     with open("/tmp/z", "wb") as file_obj:
    
       blob.download_to_file(file_obj)
    
     subprocess.call(["sed", "-i", "-e",  "s/;/,/", "/tmp/z"])
    
    
     job_config = bigquery.LoadJobConfig(
    
       autodetect=True,
       skip_leading_rows=1,
       field_delimiter=",",
       write_disposition="WRITE_APPEND",
          
    
       # The source format defaults to CSV, so the line below is optional.
       source_format=bigquery.SourceFormat.CSV,
     )
      
     with open("/tmp/z", "rb") as source_file:
    
       source_file.seek(0)
      
    
       job = client.load_table_from_file(source_file, table_id, job_config=job_config)
    
       # Make an API request.
     job.result()  # Waits for the job to complete.
    

    requirements.txt

    # Function dependencies, for example:
    # package>=version
    google-cloud
    google-cloud-bigquery
    google-cloud-storage
    

    Here, I am substituting the “;” with “,” using the Sed command. One point to note is while writing to a file in Cloud Functions, we need to give the path as /tmp/file_name, as it is the only place in Cloud Functions where writing to a file is allowed. It also assumed that there are no additional commas or semicolons in the files in addition to the delimiter.