python-3.x google-cloud-platform google-cloud-storage avro google-cloud-datalab

Writing to GCS using AVRO within AI Notebook

Summary:

1) How to write a Pandas data frame into GCS(Google cloud storage) within a Jupyter Notebook(like AI Notebook)

2) In the same notebook, how to call that object to be uploaded into a new data set in Bigquery

Problem

I do have an object that is big enough to make unfeasible to download it locally and then write it on GCS -> BQ. However, the object is not big enough to be processed using Apache-Beam. I brought into the notebook using BQ magic. After making some transformations, I want to send an object back towards my data repositories. Therefore, I am trying to use AVRO to copy it but I can not figure out how to make it work. I have tried following this guide(https://github.com/ynqa/pandavro), but I have not figured yet how the function should be spelt.

I´m doing this:

OUTPUT_PATH='{}/resumen2008a2019.avro'.format('gcs://xxxx')
pdx.to_avro(OUTPUT_PATH,df4)

That is returning me the following error: FileNotFoundError: [Errno 2] No such file or directory: 'gcs://xxxx'

Why not Parquet? It is not being able to transform the data correctly into a JSON: ArrowInvalid: ('Could not convert with type str: tried to convert to double', 'Conversion failed for column salario with type object')

Why not directly? I tried using this post as guide (Write a Pandas DataFrame to Google Cloud Storage or BigQuery). But it is three year old, and many of the stuff does not work like that anymore.

Should I surrender, and just write a classic ol´ csv?

Solution

Writing a DataFrame to BigQuery directly is very much supported and works without a hitch.

Assuming you are using Google Cloud AI Platform notebook (so that we don't need to setup service accounts and install bq packages ) you can do the following to write from Dataframe to a BQ Table:

    client = bigquery.Client(location="US")
    dataset_id = 'your_new_dataset'
    dataset = client.create_dataset(dataset_id) 

    records = [
        {"title": "The Meaning of Life", "release_year": 1983},
        {"title": "Monty Python and the Holy Grail", "release_year": 1975},
        {"title": "Life of Brian", "release_year": 1979},
        {"title": "And Now for Something Completely Different", "release_year": 1971},
    ]

    # Optionally set explicit indices.
    # If indices are not specified, a column will be created for the default
    # indices created by pandas.
    index = ["Q24980", "Q25043", "Q24953", "Q16403"]
    df = pandas.DataFrame(records, index=pandas.Index(index, name="wikidata_id"))

    table_ref = dataset.table("monty_python")
    job = client.load_table_from_dataframe(df, table_ref, location="US")

    job.result()  # Waits for table load to complete.
    print("Loaded dataframe to {}".format(table_ref.path))

If you do want to use Pandavro then you will need to modify the output path "gs://" as this is not a local path and will not be understood by tools that can only write to file system. You will essentially have to split this into the following steps:

Write the file to a local directory
Run a load job with the resultant avro file into BigQuery