Search code examples
apache-sparkparquet

How does schema inference work in spark.read.parquet?


I'm trying to read a parquet file on spark and I have a question.

How is the type inferred when loading a parquet file with spark.read.parquet?

  • 1. Parquet Type INT32 -> Spark Type IntegerType
  • 2. Parquet inferred from actual stored values -> Spark IntegerType

Is there a dictionary for mapping like 1? Or is it inferred from the actual stored values like 2?


Solution

  • Spark uses the parquet schema to parse it to an internal representation (i.e, StructType), it is a bit hard to find this information on spark docs. I went through the code to find the mapping you are looking for here:

    https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L197-L281