Context
In AzureML, we are facing an error when running a pipeline. It fails on to_pandas_dataframe
because a particular dataset "could not be read beyond end of stream". On its own, this seems to be an issue with the parquet file that is being registered, maybe special characters being misinterpreted.
However, when we explicitly load a previous "version" of this Dataset--which points to the exact same location of data--it works as expected. In the documentation (here), Azure says that "when you load data from a dataset, the current data content referenced by the dataset is always loaded." This makes me think that a new version of the dataset with the same schema will be, well, the same.
Questions
What makes a Dataset version different from another version when both point to the same location? Is it only the schema definition?
Based on these differences, is there a way to figure out why one version would be succeeding and another failing?
Attempts
As rightly suggested by @Anand Sowmithiran in comment section, This looks more like a bug with the SDK.
You can raise Azure support ticket