Search code examples
pythonpysparkpetastorm

No fields matching the criteria 'None' were found in the dataset


I'm trying to load a spark dataframe via petastorm 0.12 following the tutorial given in the petastorm-spark-converter-tensorflow notebook. Essentially my code is the following. The error described in the title is raised in the with statement. (Doesn't happen when directly creating a TFDatasetContextManager via train_context_manager = converter_train.make_tf_dataset(BATCH_SIZE) though.

from petastorm import TransformSpec
from petastorm.spark import make_spark_converter

spark.conf.set(SparkDatasetConverter.PARENT_CACHE_DIR_URL_CONF, "file:///dbfs/tmp/petastorm/cache"

converter_train = make_spark_converter(DF_TRAIN)
with converter_train.make_tf_dataset(BATCH_SIZE) as X_train:
  pass

The dataset definitely isn't empty. I also tried to apply a TransformSpec explicitly selecting my target columns

with converter_train.make_tf_dataset(
    BATCH_SIZE,
    transform_spec=TransformSpec(selected_fields=[TRAIN_COL])
) as X_train:

Btw the same happens with converter_train.make_torch_dataloader


Solution

  • The error message really is misleading - the reason was that the column I read was of type Array<Array<float>>. It turns out that the invoked make_batch_reader of petastorm cannot handle this type:

    NOTE: only scalar columns or array type (of primitive type element) columns are currently supported.