Search code examples
google-bigquery

BigQuery fails on parsing dates in M/D/YYYY format from CSV file


Problem

I'm attempting to create a BigQuery table from a CSV file in Google Cloud Storage.

I'm explicitly defining the schema for the load job (below) and set header rows to skip = 1.

Data

$ cat date_formatting_test.csv
id,shipped,name
0,1/10/2019,ryan
1,2/1/2019,blah
2,10/1/2013,asdf

Schema

id:INTEGER,
shipped:DATE,
name:STRING

Error

BigQuery produces the following error:

Error while reading data, error message: Could not parse '1/10/2019' as date for field shipped (position 1) starting at location 17

Questions

I understand that this date isn't in ISO format (2019-01-10), which I'm assuming will work.

However, I'm trying to define a more flexible input configuration whereby BigQuery will correctly load any date that the average American would consider valid.

  1. Is there a way to specify the expected date format(s)?
  2. Is there a separate configuration / setting to allow me to successfully load the provided CSV in with the schema defined as-is?

Solution

  • According to the listed limitations:

    When you load CSV or JSON data, values in DATE columns must use 
    the dash (-) separator and the date must be in the following 
    format: YYYY-MM-DD (year-month-day).
    

    So this leaves us with 2 options:

    Option 1: ETL

    • Place new CSV files in Google Cloud Storage
    • That in turn triggers a Google Cloud Function or Google Cloud Composer job to:
      • Edit the date column in all the CSV files
      • Save the edited files back to Google Cloud Storage
      • Load the modified CSV files into Google BigQuery

    Option 2: ELT

    • Load the CSV file as-is to BigQuery (i.e. your schema should be modified to shipped:STRING)
    • Create a BigQuery view that transforms the shipped field from a string to a recognised date format. Use SELECT id, PARSE_DATE('%m/%d/%Y', shipped) AS shipped, name
    • Use that view for your analysis

    I'm not sure, from your description, if this is a once-off job or recurring. If it's once-off, I'd go with Option 2 as it requires the least effort. Option 1 requires a bit more effort, and would only be worth it for recurring jobs.