I would like to create a function that takes a date as an input and then filters a monthly-level dataframe to only 1 years' worth of records using the user provided date as its start. My plan is to determine the date that is 1 year away from what the user submitted and then use that date in conjunction with the user-submitted date to filter down the original dataframe.
I'm working on the below, but the problem seems to be that Pyspark wants to operate with dataframes, not a single object? Everything I can find on the topic is about converting or creating whole columns within a larger df. I think there's something I need to do with collect() to get a single value, but I'm not sure. At this point I'm assuming the user submitted date matches the format of the dataframe already (monthly data with the format 2017-01-01). So for example, if a user submitted "2017-01-01," the desired result would be a dataframe with records from only the 2017.
def one_year_data(df, st_date):
st_date = to_date(st_date)
end_date = add_months(st_date, 12)
df1 = df.filter((df.date < end_date) & (df.date >= st_date))
return df1
I get an error that "Column '2022-07-01
' does not exist." When I print end_date I get "Column<'add_months(to_date(2022-07-01), 12)'>"
St_date is a string, but the pyspark to_date function expects a column. You can solve this by doing to_date(pyspark.sql.functions.lit(st_date)), or by finding end_date using pythons datetime module and do your filtering with the datetime objects.