Search code examples
pythonpysparkpyspark-pandas

Pandas API on Spark - Difference between two date columns


I want the difference between two date columns in the number of days.
In pandas dataframe difference in two "datetime64" type columns returns number of days
but in pyspark.pandas dataframe the difference is returned in the "int" type.

import pandas as pd
import pyspark.pandas as ps

data = {
    "d1": [
        "2019-05-18",
        "2019-06-21",
        "2019-05-08",
        "2019-05-22",
        "2019-11-20",
        "2019-05-29",
    ],
    "d2": [
        "2019-05-21",
        "2019-06-21",
        "2019-05-09",
        "2019-05-23",
        "2019-11-21",
        "2019-05-30",
    ],
}
df = pd.DataFrame(data)
df[["d1", "d2"]] = df[["d1", "d2"]].astype("datetime64")
pdf = ps.from_pandas(df)

df["diff"] = (df["d1"] - df["d2"]).dt.days
pdf["diff"] = pdf["d1"] - pdf["d2"]

Pandas difference Output: [-3, 0, -1, -1, -1, -1]
PySpark difference Output: [-259200, 0, -86400, -86400, -86400, -86400]

Expected Op: PySpark difference should return in days format and be accessible using dt.days

Tried:

  1. Converting int column to DateTime
pdf['diff'].astype('datetime64')

>>>['1969-12-29 00:00:00', '1970-01-01 00:00:00',
    '1969-12-31 00:00:00', '1969-12-31 00:00:00',
    '1969-12-31 00:00:00', '1969-12-31 00:00:00']
  1. The current sol that I am working with is
temp = pdf[["d1", "d2"]].to_pandas()
pdf["diff2"] = ps.Series((temp["d1"] - temp["d2"]).dt.days)

>>> [-3,  0, -1, -1, -1, -1]

This sol works, but for large datasets converting to pandas and doing operation on the two columns adds overhead and increases the runtime. So any concise way to get difference in two datetime columns in pyspark.pandas dataframe??


Solution

  • the diff is a timestamp diff, therefore, the result is in second : 259200 / 3600 / 24 = 3. Just add some math and you'll get your expected result.

    In pure Spark, you can also use datediff :

    from pyspark.sql import functions as F 
    
    
    df.select(F.datediff(df.d2, df.d1).alias('diff')).collect()
    # [Row(diff=32)]