Search code examples
apache-sparkdatepysparkapache-spark-sqltime-series

Create PySpark dataframe with timeseries column


I have an initial PySpark dataframe from which I would like to take the MIN and MAX from a date column and then create a new PySpark dataframe with a timeseries (daily date), using the MIN and MAX from my initial dataframe.

I will use it to then join with my initial dataframe and find missing days (null in the rest of the column of my inital DF).

I tried in many different ways to build the timeseries DF, but it doesn't seem to work in PySpark. Any suggestions?


Solution

  • Max column's value can be extracted like this:

    df.agg(F.max('col_name')).head()[0]
    

    Date range df can be created like this:

    df2 = spark.sql("SELECT explode(sequence(to_date('2000-01-01'), to_date('2000-02-02'), interval 1 day)) as date_col")
    

    And then join.


    Full example:

    from pyspark.sql import functions as F
    df1 = spark.createDataFrame(
        [(1, '2022-04-01'),
         (2, '2022-04-05')],
        ['id', 'df1_date'])
    
    min_date = df1.agg(F.min('df1_date')).head()[0]
    max_date = df1.agg(F.max('df1_date')).head()[0]
    
    df2 = spark.sql(f"SELECT explode(sequence(to_date('{min_date}'), to_date('{max_date}'), interval 1 day)) as df2_date")
    
    df3 = df2.join(df1, df1.df1_date == df2.df2_date, 'left')
    
    df3.show()
    # +----------+----+----------+
    # |  df2_date|  id|  df1_date|
    # +----------+----+----------+
    # |2022-04-01|   1|2022-04-01|
    # |2022-04-02|null|      null|
    # |2022-04-03|null|      null|
    # |2022-04-04|null|      null|
    # |2022-04-05|   2|2022-04-05|
    # +----------+----+----------+