Search code examples
google-cloud-platformgoogle-cloud-storageairflowexport-to-csv

AIrflow BigQueryToCloudStorageOperator saves CSV file with quotes in each row


I am using Airflow with BigQueryToCloudStorageOperator operator to save a copy of the BQ table as a CSV file in GCP storage bucket. The source table has just one column, with the content of a few comma separated fields:a1,b1,c1,d1. The output file is saved fine, but each row has double quotes at the beginning and at the end:"a1,b1,c1,d1". In the doc I don't see any parameters to control this behavior, to exclude those double quotes from the output file. Is there a way to fix this? The desired output would be just a string like a1,b1,c1,d1 , without double quotes at the beginning and at the end of each row.

There is a parameter quote_character="" which I thought can help here, when I include it into my DAG in BigQueryToCloudStorageOperator initialization, but it doesn't do anything in my case.


Solution

  • I suspect you used the field_delimiter param with , as value (it's the default value in the operator, if it's not specified) :

    def __init__(
            self,
            *,
            source_project_dataset_table: str,
            destination_cloud_storage_uris: List[str],
            project_id: Optional[str] = None,
            compression: str = 'NONE',
            export_format: str = 'CSV',
            field_delimiter: str = ','
    .....
    
    

    Your single column contains also , characters, that's why the BigQuery api adds a quote char between the column value.

    I tried with a separator different from , for example with | character and the csv file was generated without quote char between the column value :

    bq extract --location=EU \
    --destination_format CSV \
    --compression NONE \
    --field_delimiter "|" \
    --print_header=true \
    "project:mazlum_test.test_list_column" \
    gs://mazlum_dev/test_column_list.csv
    

    The result csv file with | as separator is :

    my_list
    a1,b1,c1,d1
    

    The result csv file with , as separator is :

    my_list
    "a1,b1,c1,d1"
    

    With Airflow, you only have to set the field_delimiter parameter with | as value for example :

    BigQueryToGCSOperator(
                task_id='task_id',
                source_project_dataset_table='table',
                destination_cloud_storage_uris=[
                    'dest_bucket'
                ],
                compression='NONE',
                export_format='CSV',
                field_delimiter="|",
                print_header=True
            )