Search code examples
scalaapache-sparkapache-spark-sqlparquet

How to read the data of the last 3 days from a folder with parquet files?


I have a folder with many parquet files that have names as follows:

user_2018-03-15_checked_products.parquet
user_2018-03-15_unchecked_products.parquet
user_2018-03-14_checked_products.parquet
user_2018-03-14_unchecked_products.parquet
user_2018-03-13_checked_products.parquet
user_2018-03-13_unchecked_products.parquet
user_2018-03-12_checked_products.parquet
user_2018-03-12_unchecked_products.parquet

I read all files as follows:

val df = spark.read.parquet("path/to/folder")

The folder contains 100 Gb of data and its size is growing incrementally. But I need to read only the data for the last 3 days. Currently, I read the whole folder and then apply filter? Is it possible to use some kind of mask in order to select only those file names that belong to the last 3 days instead of reading the whole folder?


Solution

  • You can read all the file names and filter the file that is within 3 days as.

    val listOfFiles = ??? // read all the files names 
    
    val filteredFile = listOfFiles.filter( file => {
      val dateFormat = new SimpleDateFormat("yyyy-MM-dd")
      val fileDate =  dateFormat.parse(file.split("_")(1))  //get date from file name 
      val currentDate = dateFormat.parse(dateFormat.format(new Date())) // current date
      val days = Days.daysBetween(new LocalDateTime(fileDate), new LocalDateTime(currentDate)).getDays
      //difference in days
    
      if (days <= 3 && days >= 0) true else false
    })
    

    Now read the list of filtered files as

    spark.read.parquet(filteredFile: _*)
    

    If require append the paths.

    Hope this helps!