Search code examples
apache-sparknullparquetapache-drill

Null values best practices in Parquet files


I'm trying to figure out what is the best practice if I have a string column with possible null values.
In SQL databases null is a legit value, but from reading around I've found lots of issues and people's questions about null value in parquet files.
If I want to process these parquet files later with abroad set of tools such as Drill, Spark, etc. what is the best approach for storing null values, as nulls or empty strings?


Solution

  • Well this is not about other tools or Spark its about your business logic considers a null or a empty string "" differently because many other application considers them as separate logical entity,

    But if your application treats them same then you can just mark them to more safer option as empty string "" which would avoid all future NullpointerExceptions for that column.

    AFAIK all other Big data component (Drill,Spark,etc including Parquet file format) handles null value very well.