Search code examples
pythonazureazure-machine-learning-serviceazureml-python-sdk

AzureML: Dataset Profile fails when parquet file is empty


I have created a Tabular Dataset using Azure ML python API. Data under question is a bunch of parquet files (~10K parquet files each of size of 330 KB) residing in Azure Data Lake Gen 2 spread across multiple partitions. When I trigger "Generate Profile" operation for the dataset, it throws following error while handling empty parquet file and then the profile generation stops.

User program failed with ExecutionError: 
Error Code: ScriptExecution.StreamAccess.Validation
Validation Error Code: NotSupported
Validation Target: ParquetFile
Failed Step: 77866d0a-8243-4d3d-8bc6-599d466488dd
Error Message: ScriptExecutionException was caused by StreamAccessException.
  Failed to read Parquet file at: <my_blob_path>/20211217.parquet
    Current parquet file is not supported.
      Exception of type 'Thrift.Protocol.TProtocolException' was thrown.
| session_id=6be4db0b-bdc1-4dd6-b8a6-6e9466f7bc54

By empty parquet file, I mean that the if I read the individual parquet file using pandas (pd.read_parquet), it results in an empty DF (df.empty == True).

Any suggestion to avoid this error will be appreciated.

Update The issue has been fixed in the following version:

  • azureml-dataprep : 3.0.1
  • azureml-core : 1.40.0

Solution

  • Thanks for reporting it. This is a bug in handling of the parquet files with columns but empty row set. This has been fixed already and will be included in next release.