apache-spark pyspark databricks data-partitioning

Reading spark partitioned data from directories

My data is partitioned as Year,month,day in s3 Bucket. I have a requirement to read last six months of data everyday.I am using below code to read the data but it is selecting negative value in months. is there a way to read the correct data for last six months?

from datetime import datetime
d = datetime.now().day
m = datetime.now().month
y = datetime.now().year
df2=spark.read.format("parquet") \
  .option("header","true").option("inferSchema","true") \
  .load("rawdata/data/year={2021,2022}/month={m-6,m}/*")

Solution

You can use a list of addresses (strings) as your .load() argument. First you can create the list for six months backward (from today):

from datetime import date
from dateutil.relativedelta import relativedelta

y_m_list = [((date.today()+relativedelta(months=-i)).year, (date.today()+relativedelta(months=-i)).month)  for i in range(0,6)]

y_m_list

Output:

[(2022, 1), (2021, 12), (2021, 11), (2021, 10), (2021, 9), (2021, 8)]

Then create the argument as

.load([f"rawdata/data/year={x}/month={y}" for x,y in y_m_list])