Search code examples
pythonpandasdataframezipwatson-studio

How to read a compressed csv-file with pandas read_csv in Watson Studio?


To read a zip-compressed csv-file with pandas in my local Jupyter notebook I execute:

import pandas as pd
pd.read_csv('csv_file.zip')

However, in Watson Studio, read_csv() throws an exception when I replace a filename with a cloud object storage streaming object.

This is the first cell of my notebook in Watson Studio:

import types
from ibm_botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

client = ibm_boto3.client(service_name='s3', ibm_api_key_id='...',
    ibm_auth_endpoint="...", config=Config(signature_version='oauth'),
    endpoint_url='...')

body = client.get_object(Bucket='...', Key='csv_file.zip')['Body']
if not hasattr(body, "__iter__"):
    body.__iter__ = types.MethodType( __iter__, body )

Now, when I try:

import pandas as pd
df = pd.read_csv(body)

I get:

'utf-8' codec can't decode byte 0xbb in position 0: invalid start byte

If I specify compression='zip':

import pandas as pd
df = pd.read_csv(body, compression='zip')

the message is:

'StreamingBody' object has no attribute 'seek'

Is there a direct way to read_csv() a zipped file in Watson Studio without explicitly writing an unpacking code?

(The pd.__version__ is 0.21.0 in both environments.)


Solution

  • The following procedure works if your file is already added as a data asset of your Watson Studio project.

    1. Create a project token for your project. In your project, go to Settings, navigate to the Access tokens section and click in the option New token (it is enough to select "Viewer" in the "Access role for project" dropdown menu there).

    2. Now, in your notebook in "edit" mode, there are three dots () on the top right corner of the screen and there you click insert your token. A new first cell will be added with your project credentials, now you run it.

    3. Now you can use a code like this:

    file = project.get_file("my_compressed_csv.zip")
    df = pd.read_csv(file, compression='zip')
    

    The read_csv() option does not work directly in this situation in Watson Studio, so you need to use the project-lib library.