I'm trying to load a spark dataframe via petastorm 0.12 following the tutorial given in the petastorm-spark-converter-tensorflow notebook. Essentially my code is the following. The error described in the title is raised in the with
statement. (Doesn't happen when directly creating a TFDatasetContextManager via train_context_manager = converter_train.make_tf_dataset(BATCH_SIZE)
though.
from petastorm import TransformSpec
from petastorm.spark import make_spark_converter
spark.conf.set(SparkDatasetConverter.PARENT_CACHE_DIR_URL_CONF, "file:///dbfs/tmp/petastorm/cache"
converter_train = make_spark_converter(DF_TRAIN)
with converter_train.make_tf_dataset(BATCH_SIZE) as X_train:
pass
The dataset definitely isn't empty. I also tried to apply a TransformSpec explicitly selecting my target columns
with converter_train.make_tf_dataset(
BATCH_SIZE,
transform_spec=TransformSpec(selected_fields=[TRAIN_COL])
) as X_train:
Btw the same happens with converter_train.make_torch_dataloader
The error message really is misleading - the reason was that the column I read was of type Array<Array<float>>
. It turns out that the invoked make_batch_reader
of petastorm cannot handle this type:
NOTE: only scalar columns or array type (of primitive type element) columns are currently supported.