Search code examples
dataframedatepyspark

Pyspark: how to fix 'could not parse datatype: interval' error


I'm trying to add a new column to a pyspark df by substracting the values of two existing columns.

I already had a date_of_birth column available, so I inserted a current_date column with the following code:

import datetime
currentdate = "14-12-2021"
day,month,year = currentdate.split('-')
today = datetime.date(int(year),int(month),int(day))

df= df.withColumn("current_date", lit(today))

Displaying my df confirms that it worked. Looks a little something like this:

id date_of_birth current_date
01 1995-01-01 2021-12-2021
02 1987-02-16 2021-12-2021

I inserted the age column by substracting the values of date_of_birth and current_date.

df = df.withColumn('age', (df['current_date'] - df['date_of_birth ']))

Cell runs without a problem.

Here's where I'm stuck:

Once I try to display my dataframe again in order to verify that everything went smoothly, the following error occurs:

'could not parse datatype: interval'

I used df.types() to check what's happening, and apparently my newly inserted age column is of interval type.

How can I fix this?

Is there a way to display the age in years (int) in this particular scenario?

PS: both the date_of_birth and current_date cols have a date type.


Solution

  • Solved it. Mike's comment helped tons. Thank you!

    Here's how I solved it:

    
    # insert new column current_date with dummy data (in this case, 1s)
    df = df.withColumn("current_date", lit(1))
    
    # update data with current_date() function
    df  = df .withColumn("current_date", f.current_date())
    
    # insert new column age with dummy data (in this case, 1s)
    df  = df .withColumn("age", lit(1))
    
    # update data with months_between() function, divide by 12 to obtain years.
    df  = df  .withColumn("age", f.months_between(df.current_date, df  .date_of_birth)/12)
    
    # round and cast as interger to get rid of decimals
    df  = df  .withColumn("age", f.round(df["age"]).cast('integer'))