Search code examples
scalaapache-spark

Scala Spark read with partitions drop partitions


There is hdfs-directory:

/home/path/date=2022-12-02, where date=2022-12-02 is a partition.

Parquet file with the partition "date=2022-12-02", has been written to this directory.

To read file with partition, I use:

   spark
        .read
        .option("basePath", "/home/path")
        .parquet("/home/path/date=2022-12-02")

The file is read successfully with all partition-fieds.

But, partition folder ("date=2022-12-02") is dropped from directory.

I can't grasp, what is the reason and how to avoid it.


Solution

  • There are two ways to have the date as part of your table;

    1. Read the path like this: .parquet("/home/path/")

    2. Add a new column and use input_file_path() function, then manipulate with the string until you get date column (should be fairly easy, taking last part after slash, splitting on equal sign and taking index 1).

    I don't think there is another way to do what you want directly.