Search code examples
pyspark

Problem creating a function that takes a date for an input and uses it to filter a dataframe


I would like to create a function that takes a date as an input and then filters a monthly-level dataframe to only 1 years' worth of records using the user provided date as its start. My plan is to determine the date that is 1 year away from what the user submitted and then use that date in conjunction with the user-submitted date to filter down the original dataframe.

I'm working on the below, but the problem seems to be that Pyspark wants to operate with dataframes, not a single object? Everything I can find on the topic is about converting or creating whole columns within a larger df. I think there's something I need to do with collect() to get a single value, but I'm not sure. At this point I'm assuming the user submitted date matches the format of the dataframe already (monthly data with the format 2017-01-01). So for example, if a user submitted "2017-01-01," the desired result would be a dataframe with records from only the 2017.

def one_year_data(df, st_date):
    
    st_date = to_date(st_date)
    end_date = add_months(st_date, 12)

    df1 = df.filter((df.date < end_date) & (df.date >= st_date))

    return df1

I get an error that "Column '2022-07-01' does not exist." When I print end_date I get "Column<'add_months(to_date(2022-07-01), 12)'>"


Solution

  • St_date is a string, but the pyspark to_date function expects a column. You can solve this by doing to_date(pyspark.sql.functions.lit(st_date)), or by finding end_date using pythons datetime module and do your filtering with the datetime objects.