Search code examples
scalaapache-sparkamazon-s3apache-spark-sqlparquet

Partition id getting casted implicitly while reading from s3 in spark/scala


I have source data in s3 and my spark/scala application will read this data and write as parquet files after partitioning it on a new column partition_id. The value of partition_id will be derived by taking first two characters from another id column having an alphanumeric string value. For example:

id = 2dedfdg34h, partition_id = 2d

After writing the data into s3, separate partition folders will be created for each partition and everything looks good. For example:

PRE partition_id=2d/
PRE partition_id=01/
PRE partition_id=0e/
PRE partition_id=fg/
PRE partition_id=5f/
PRE partition_id=jk/
PRE partition_id=06/
PRE partition_id=07/

But when I read these s3 files again into a dataframe, values like 1d, 2d, etc are getting converted to 1.0, 2.0.

Spark version: 2.4.0

Please suggest on how to avoid this implicit conversion.

The command used to write and read partitioned data to/from s3:

dataframe.write.partitionBy("partition_id").option("compression", "gzip").parquet(<path>)
spark.read.parquet(<path>)

Solution

  • The issue here is that Spark erroneously infer that the column type of the partition column is a number. This is due to some of the values actullay being numbers (Spark will not look through all of them).

    What you can do to avoid this is simply turning off the automatic type inference of the partition columns when reading the data. This will give you the original string values as wanted. This can be done as follows:

    spark.conf.set("spark.sql.sources.partitionColumnTypeInference.enabled", "false")