python pandas azure-blob-storage azure-sdk

Memory Error:While reading a large .txt file from BLOB in python

I am trying to read a large (~1.5 GB) .txt file from Azure blob in python which is giving Memory Error. Is there a way in which I can read this file in an efficient way?

Below is the code that I am trying to run:

from azure.storage.blob import BlockBlobService
import pandas as pd
from io import StringIO
import time

STORAGEACCOUNTNAME= '*********'
STORAGEACCOUNTKEY= "********"

CONTAINERNAME= '******'
BLOBNAME= 'path/to/blob'

blob_service = BlockBlobService(account_name=STORAGEACCOUNTNAME, account_key=STORAGEACCOUNTKEY)

start = time.time()
blobstring = blob_service.get_blob_to_text(CONTAINERNAME,BLOBNAME).content

df = pd.read_csv(StringIO(blobstring))
end = time.time()

print("Time taken = ",end-start)

Below are last few lines of the error:

---> 16 blobstring = blob_service.get_blob_to_text(CONTAINERNAME,BLOBNAME)
     17 
     18 #df = pd.read_csv(StringIO(blobstring))

~/anaconda3_420/lib/python3.5/site-packages/azure/storage/blob/baseblobservice.py in get_blob_to_text(self, container_name, blob_name, encoding, snapshot, start_range, end_range, validate_content, progress_callback, max_connections, lease_id, if_modified_since, if_unmodified_since, if_match, if_none_match, timeout)
   2378                                       if_none_match,
   2379                                       timeout)
-> 2380         blob.content = blob.content.decode(encoding)
   2381         return blob
   2382 

MemoryError:

How can I read a file of size ~1.5 GB in Python from a Blob container? Also, I want to have an optimum runtime for my code.

Solution

Assumed that there is enough memory in your machine, and according to the pandas.read_csv API reference below, you can directly read the csv blob content into pandas dataframe by the csv blob url with sas token.

Here is my sample code as refenece for you.

from azure.storage.blob.baseblobservice import BaseBlobService
from azure.storage.blob import BlobPermissions
from datetime import datetime, timedelta

import pandas as pd

account_name = '<your storage account name>'
account_key = '<your storage account key>'
container_name = '<your container name>'
blob_name = '<your csv blob name>'

url = f"https://{account_name}.blob.core.windows.net/{container_name}/{blob_name}"

service = BaseBlobService(account_name=account_name, account_key=account_key)
# Generate the sas token for your csv blob
token = service.generate_blob_shared_access_signature(container_name, blob_name, permission=BlobPermissions.READ, expiry=datetime.utcnow() + timedelta(hours=1),)

# Directly read the csv blob content into dataframe by the url with sas token
df = pd.read_csv(f"{url}?{token}")
print(df)

I think it will avoid to copy memory few times when read the text content and convert it to a file-like object buffer.

Hope it helps.