azure version azure-synapse azure-machine-learning-service

Azure ML Dataset Versioning: What is Different if it Points to the Same Data?

Context

In AzureML, we are facing an error when running a pipeline. It fails on to_pandas_dataframe because a particular dataset "could not be read beyond end of stream". On its own, this seems to be an issue with the parquet file that is being registered, maybe special characters being misinterpreted.

However, when we explicitly load a previous "version" of this Dataset--which points to the exact same location of data--it works as expected. In the documentation (here), Azure says that "when you load data from a dataset, the current data content referenced by the dataset is always loaded." This makes me think that a new version of the dataset with the same schema will be, well, the same.

Questions

What makes a Dataset version different from another version when both point to the same location? Is it only the schema definition?
Based on these differences, is there a way to figure out why one version would be succeeding and another failing?

Attempts

The schemas of the two versions are identical. We can profile both in AzureML, and all the fields have the same profile information.

Solution

As rightly suggested by @Anand Sowmithiran in comment section, This looks more like a bug with the SDK.

You can raise Azure support ticket