Search code examples
scalaapache-sparkhivehive-partitions

Hive sql - between multiple partitions


I got a hive table partitioned by year, month and day

CREATE TABLE t1 (
...
 )
PARTITIONED BY (
year INT,
month INT,
day INT'
)
STORED AS PARQUET;

and I need to take the data from it last 7 days/partitions. The first step towards finding how to get the last seven days of data like:

val formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd")
val runDate = date
val runDay = LocalDate.parse(runDate.toString, formatter)
val runDayMinus7 = runDay.minusDays(7)

But then I face the troll under the bridge. Is there any way to select data between those two dates by multiple partitions? Otherwise, Is there some way I'd not need to change the multiple partitions to a single partition YYYYMMDD?

Either scala, either hive.

Thanks in advance


Solution

  • I've created next method to solve that. Two parameters, date and number of days

      def dayFilter(date: LocalDate, days: Int): Column = {
       days match {
        case _ if days < 0 => lit(false)
        case 0 => (col("year") === date.getYear and
         col("month") === date.getMonthValue and
         col("day") === date.getDayOfMonth)
        case _ => (col("year") === date.getYear and
         col("month") === date.getMonthValue and
         col("day") === date.getDayOfMonth) or dayFilter(date.minusDays(1), days - 1)
       }
     }
    

    If we import Java LocalDate, then we need getMonthValue. In case, we are using joda.time then getMonthOfYear