Search code examples
azureazure-machine-learning-serviceazureml-python-sdk

AzureML: TabularDataset.to_pandas_dataframe() hangs when parquet file is empty


I have created a Tabular Dataset using Azure ML python API. Data under question is a bunch of parquet files (~10K parquet files each of size of 330 KB) residing in Azure Data Lake Gen 2 spread across multiple partitions. When I try to load the dataset using the API TabularDataset.to_pandas_dataframe(), it continues forever (hangs), if there are empty parquet files included in the Dataset. If the tabular dataset doesn't include those empty parquet files, TabularDataset.to_pandas_dataframe() completes within few minutes.

By empty parquet file, I mean that the if I read the individual parquet file using pandas (pd.read_parquet()), it results in an empty DF (df.empty == True).

I discovered the root cause while working on another issue mentioned [here][1].

My question is how can make TabularDataset.to_pandas_dataframe() work even when there are empty parquet files?

Update The issue has been fixed in the following version:

  • azureml-dataprep : 3.0.1
  • azureml-core : 1.40.0

Solution

  • Thanks for reporting it. This is a bug in handling of the parquet files with columns but empty row set. This has been fixed already and will be included in next release.

    I could not repro the hang on multiple files, though, so if you could provide more info on that would be nice.