Search code examples
javaapache-sparkparquet

Parquet file not keeping non-nullability aspect of schema when read into Spark 3.3.0


I read data from a CSV file, and supply a hand-crafted schema:

new StructType(new StructField[] {
    new StructField("id", LongType, false, Metadata.empty(),
    new StructField("foo", IntegerType, false, Metadata.empty(),
    new StructField("bar", DateType, true, Metadata.empty()) });

Printing the schema shows:

root
 |-- id: long (nullable = false)
 |-- foo: integer (nullable = false)
 |-- bar: date (nullable = true)

And writing it to a parquet file using this code ...

df.write().format("parquet").save("data.parquet");

... generates this log message:

INFO : o.a.s.s.e.d.p.ParquetWriteSupport: Initialized Parquet WriteSupport with Catalyst schema:
{
  "type" : "struct",
  "fields" : [ {
    "name" : "id",
    "type" : "long",
    "nullable" : false,
    "metadata" : { }
  }, {
    "name" : "foo",
    "type" : "integer",
    "nullable" : false,
    "metadata" : { }
  }, {
    "name" : "bar",
    "type" : "date",
    "nullable" : true,
    "metadata" : { }
  } ]
}
and corresponding Parquet message type:
message spark_schema {
  required int64 id;
  required int32 foo;
  optional int32 bar (DATE);
}

All looks good there.

However, if I then read in that parquet file using this code:

Dataset<Row> read = spark.read().format("parquet").load("data.parquet");

... and print the schema, I get:

root
 |-- id: long (nullable = true)
 |-- foo: integer (nullable = true)
 |-- bar: date (nullable = true)

As can be seen above, all columns have become nullable - the non-nullability specified in the original schema has been lost.

Now, if we take a look at some of the debug that is output during the load, it shows that Spark is correctly identifying the nullability. (I've added newlines to make this more readable):

FileMetaData(
  version:1, 
  schema:[SchemaElement(name:spark_schema, num_children:4), 
  SchemaElement(type:INT64, repetition_type:REQUIRED, name:id), 
  SchemaElement(type:INT32, repetition_type:REQUIRED, name:foo), 
  SchemaElement(type:INT32, repetition_type:OPTIONAL, name:bar, converted_type:DATE, logicalType:<LogicalType DATE:DateType()>)], 
  num_rows:7, 
  row_groups:null, 
  key_value_metadata:
      [
        KeyValue(key:org.apache.spark.version, value:3.3.0), 
        KeyValue(
          key:org.apache.spark.sql.parquet.row.metadata, 
          value:{
            "type":"struct",
            "fields":
                [
                  {"name":"id","type":"long","nullable":false,"metadata":{}},
                  {"name":"foo","type":"integer","nullable":false,"metadata":{}},
                  {"name":"bar","type":"date","nullable":true,"metadata":{}}
                ]
          })
        ], 
  created_by:parquet-mr version 1.12.2 (build 77e30c8093386ec52c3cfa6c34b7ef3321322c94))

The question is then: why (and where) is the non-nullability being lost? And, how can I ensure that this nullability information is correctly preserved when reading in the parquet file?

(Note that in my real use-case, I can't just hand-apply the schema again, I need it to be carried in the parquet file and correctly reconstituted on the read).


Solution

  • This is a documented behaviour. From https://spark.apache.org/docs/3.3.0/sql-data-sources-parquet.html

    Parquet is a columnar format that is supported by many other data processing systems. Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data. When reading Parquet files, all columns are automatically converted to be nullable for compatibility reasons.