Search code examples
apache-sparkpysparkdatabricksdata-partitioning

Reading spark partitioned data from directories


My data is partitioned as Year,month,day in s3 Bucket. I have a requirement to read last six months of data everyday.I am using below code to read the data but it is selecting negative value in months. is there a way to read the correct data for last six months?

from datetime import datetime
d = datetime.now().day
m = datetime.now().month
y = datetime.now().year
df2=spark.read.format("parquet") \
  .option("header","true").option("inferSchema","true") \
  .load("rawdata/data/year={2021,2022}/month={m-6,m}/*")

Solution

  • You can use a list of addresses (strings) as your .load() argument. First you can create the list for six months backward (from today):

    from datetime import date
    from dateutil.relativedelta import relativedelta
    
    y_m_list = [((date.today()+relativedelta(months=-i)).year, (date.today()+relativedelta(months=-i)).month)  for i in range(0,6)]
    
    y_m_list 
    

    Output:

    [(2022, 1), (2021, 12), (2021, 11), (2021, 10), (2021, 9), (2021, 8)]
    

    Then create the argument as

    .load([f"rawdata/data/year={x}/month={y}" for x,y in y_m_list])