Search code examples
pandasazureutf-8azure-machine-learning-service

Getting "stream did not contain valid UTF-8" while trying pull in data into a Pandas dataframe in Azure Machine Learning


I have some data stored inside a storage account in Azure.

I have created a datastore linking this storage account to the Azure Machine Learning workspace. I have created 2 data assets in the azure ML workspace :

  1. One for the individual parquet file containing the data
  2. Another for the folder that holds the file.

I want to pull this data into a pandas dataframe in the azure ML notebook. The folder will contain multiple files and I want to create a single dataframe using all these files so I want something that points to the folder and pulls in all the data from that folder into a data frame.

When I pull in the data for the individual file, I am able to populate the dataframe without any issue. However when I try to do the same for the the entire folder, I get errors.

This is the code I am using. It is generated by Azure itself when we go to the 'Consume' tab of the data asset.

import mltable
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

ml_client = MLClient.from_config(credential=DefaultAzureCredential())
data_asset = ml_client.data.get("folder_name", version="1")

path = {
  'folder': data_asset.path
}

tbl = mltable.from_delimited_files(paths=[path])
df = tbl.to_pandas_dataframe()
df

When I run this code, I get this error:

UserErrorException: Error Code: ScriptExecution.StreamAccess.Unexpected Native Error: Dataflow visit error: ExecutionError(StreamError(Unknown("stream did not contain valid UTF-8", Some(Error { kind: InvalidData, message: "stream did not contain valid UTF-8" })))) VisitError(ExecutionError(StreamError(Unknown("stream did not contain valid UTF-8", Some(Error { kind: InvalidData, message: "stream did not contain valid UTF-8" })))))

=> Failed with execution error: error in streaming from input data sources ExecutionError(StreamError(Unknown("stream did not contain valid UTF-8", Some(Error { kind: InvalidData, message: "stream did not contain valid UTF-8" })))) Error Message: Got unexpected error: stream did not contain valid UTF-8. Error { kind: InvalidData, message: "stream did not contain valid UTF-8" }|

The data contains some different scripts like Chinese, Japanese and Hindi but that is not causing any issue when I try to pull in the data from the single file.


Solution

  • You are using the wrong function to load parquet files.

    Use from_parquet_files instead of using from_delimited_files.

    In the consume tab, it gives the default code for CSV files if the file is not detected properly. It also gives you a warning about it.

    enter image description here

    When you do this for an individual file, it detects the file format and gives you the code for reading a parquet file like pd.read_parquet(data_asset.path).

    But when you create an asset as a folder and if it fails to detect the folder type, it gives the default code for reading CSV files.

    So, use the code below for reading a parquet file.

    import mltable
    from azure.ai.ml import MLClient
    from azure.identity import DefaultAzureCredential
    
    ml_client = MLClient.from_config(credential=DefaultAzureCredential())
    data_asset = ml_client.data.get("folder_data_par", version="1")
    
    path = {
      'folder': data_asset.path
    }
    
    tbl = mltable.from_parquet_files(paths=[path])
    df = tbl.to_pandas_dataframe()
    df
    

    Output:

    enter image description here