Search code examples
pythonazureazure-machine-learning-service

Azure Machine Learning - Memory Error while creating dataframe


I am getting memory error while creating simple dataframe read from CSV file on Azure Machine Learning using notebook VM as compute instance. The VM has config of DS 13 56gb RAM, 8vcpu, 112gb storage on Ubuntu (Linux (ubuntu 16.04). CSV file is 5gb file.

blob_service = BlockBlobService(account_name,account_key)
blobstring = blob_service.get_blob_to_text(container,filepath).content
dffinaldata = pd.read_csv(StringIO(blobstring), sep=',')

What I am doing wrong here ?


Solution

  • you need to provide the right encoding when calling get_blob_to_text, please refer to the sample.

    The code below is what normally use for reading data file in blob storages. Basically, you can use blob’s url along with sas token and use a request method. However, You might want to edit the ‘for loop’ depending what types of data you have (e.g. csv, jpg, and etc).

    -- Python code below --

    import requests
    from azure.storage.blob import BlockBlobService, BlobPermissions
    from azure.storage.blob.baseblobservice import BaseBlobService
    from datetime import datetime, timedelta
    
    account_name = '<account_name>'
    account_key = '<account_key>'
    container_name = '<container_name>'
    
    blob_service=BlockBlobService(account_name,account_key)
    generator = blob_service.list_blobs(container_name)
    
    for blob in generator:
        url = f"https://{account_name}.blob.core.windows.net/{container_name}"
        service = BaseBlobService(account_name=account_name, account_key=account_key)
        token = service.generate_blob_shared_access_signature(container_name, img_name, permission=BlobPermissions.READ, expiry=datetime.utcnow() + timedelta(hours=1),)
        url_with_sas = f"{url}?{token}"
        response = requests.get(url_with_sas)
    

    Please follow the below link to read data on Azure Blob Storage. https://learn.microsoft.com/en-us/azure/machine-learning/how-to-access-data