Search code examples
google-cloud-datalab

Google datalab : how to import pickle


Is it possible in Google Datalab to read pickle/joblib models from Google Storage using %%storage clause?

This question relates to Is text the only content type for %%storage magic function in datalab


Solution

  • Run the following code in an otherwise empty cell:

    %%storage read --object <path-to-gcs-bucket>/my_pickle_file.pkl --variable test_pickle_var
    

    Then run following code:

    from io import BytesIO    
    pickle.load(BytesIO(test_pickle_var))
    

    I used the code below to upload a pandas DataFrame to Google Cloud Storage as a pickled file and read it back:

    from datalab.context import Context
    import datalab.storage as storage
    import pandas as pd
    from io import BytesIO
    import pickle
    
    df = pd.DataFrame(data=[{1,2,3},{4,5,6}],columns=['a','b','c'])
    
    # Create a local pickle file
    df.to_pickle('my_pickle_file.pkl')
    
    # Create a bucket in GCS
    sample_bucket_name = Context.default().project_id + '-datalab-example'
    sample_bucket_path = 'gs://' + sample_bucket_name
    sample_bucket = storage.Bucket(sample_bucket_name)
    if not sample_bucket.exists():
        sample_bucket.create()
    
    # Write pickle to GCS
    sample_item = sample_bucket.item('my_pickle_file.pkl')
    with open('my_pickle_file.pkl', 'rb') as f:
        sample_item.write_to(bytearray(f.read()), 'application/octet-stream')
    
    # Read Method 1 - Read pickle from GCS using %storage read (note single % for line magic)
    path_to_pickle_in_gcs = sample_bucket_path + '/my_pickle_file.pkl'
    %storage read --object $path_to_pickle_in_gcs --variable remote_pickle_1
    df_method1 = pickle.load(BytesIO(remote_pickle_1))
    print(df_method1)
    
    # Read Alternate Method 2 - Read pickle from GCS using storage.Bucket.item().read_from()
    remote_pickle_2 = sample_bucket.item('my_pickle_file.pkl').read_from()
    df_method2 = pickle.load(BytesIO(remote_pickle_2))
    print(df_method2)
    

    Note: There is a known issue where the %storage command does not work if it is the first line in a cell. Put a comment or python code on the first line.