Search code examples
google-cloud-datalab

How do I open a gzip file in Google Datalab?


I have a bucket that contains a file.csv.gz. It's around 210MB and I'd like to read it into pandas. Anyone know how to do that?

For a non-gz, this works:

%gcs read --object gs://[bucket-name]/[path/to/file.csv] --variable csv

# Store in a pandas dataframe
df = pd.read_csv(StringIO(csv))

Solution

  • You can still use pandas.read_csv, but you have to specify compression=’gzip’, and import StringIO from pandas.compat.

    I tried the code below with a small file in my Datalab, and it worked for me.

    %gcs read --object gs://[bucket-name]/[path/to/file.csv] --variable my_file 
    
    import pandas as pd
    from pandas.compat import StringIO
    
    df = pd.read_csv(StringIO(my_file), compression='gzip')
    df