Search code examples
dataframepysparkavro

pyspark unscaled value too large for precision spark


I am trying to read avro files written by pyspark with different schema. Difference in precision of decimal column. Below is my folder structure of avro folders written by pyspark

/mywork/avro_data/day1/part-*
/mywork/avro_data/day2/part-*

below are the schema for them

day1 = spark.read.format('avro').load('/mywork/avro_data/day1')
day1.printSchema()
root
 |-- price: decimal(5,2) (nullable = True)

day2 = spark.read.format('avro').load('/mywork/avro_data/day2')
day2.printSchema()
root
 |-- price: decimal(20,2) (nullable = True)

While reading the whole dataframe (for both days)

>>> df = spark.read.format('avro').load('/mywork/avro_data/')

it is giving below error

java.lang.IllegalArgumentException: unscaled value too large for precision spark

Why pyspark doesn't implicitly considers the higher schema (backward compatible)


Solution

  • spark uses first sample record to infer the schema. I think, in your case that sample record is of decimal(5, 2) causing this exception. Regarding your question-

    Why does pyspark doesn't implicitly considers the higher schema?

    To achieve this, spark needs to read the whole data twice. first to infer schema and second for processing. Imagine, even df.limit(1) will read whole file first to infer the schema and then to read 1st record if you go this way.

    There is an option to specify avroSchema option as below -

     val p = spark
          .read
          .format("avro")
          .option("avroSchema", schema.toString)
          .load(directory)
        p.show(false)
    

    but here each avro file inside .load(directory) should match the schema which is not the case here.

    Alternative

    read both the dataframe and then do union