Search code examples
apache-sparkdatepysparkapache-spark-sqldatediff

Compute the difference in days between the max and the min of a date type column


My table loaded in PySpark has a column "Date" with the following type of data:

Date Open High Low Close Volume Adj Close
1/3/2012 59.97 61.06 59.87 60.33 12668800 52.61923
1/4/2012 60.21 60.35 59.47 59.71 9593300 52.07848
1/5/2012 59.35 59.62 58.37 59.42 12768200 51.82554
1/6/2012 59.42 59.45 58.87 59 8069400 51.45922

How do I calculate the difference, in days, between the max and the min of the column (so in the example above, I need the difference in day between 1/6/2012 and 1/3/2012


Solution

  • Test data:

    from pyspark.sql import functions as F
    
    df = spark.createDataFrame([('2012-01-03',),('2013-02-03',),('2011-11-29',)], ['Date']).select(F.col('Date').cast('date'))
    df.show()
    # +----------+
    # |      Date|
    # +----------+
    # |2012-01-03|
    # |2013-02-03|
    # |2011-11-29|
    # +----------+
    

    This will create a new dataframe containing the difference in days:

    df_diff = df.groupBy().agg(F.datediff(F.max('Date'), F.min('Date')).alias('diff'))
    
    df_diff.show()
    # +----+
    # |diff|
    # +----+
    # | 432|
    # +----+
    
    # If you need the difference in a variable:
    v = df_diff.head().diff
    print(v)
    # 432
    

    And this will add a new column to your existing df:

    df = df.withColumn('diff', F.expr('datediff(max(Date) over(), min(Date) over())'))
    
    df.show()
    # +----------+----+
    # |      Date|diff|
    # +----------+----+
    # |2012-01-03| 432|
    # |2013-02-03| 432|
    # |2011-11-29| 432|
    # +----------+----+