Search code examples
pythondatedatetimeazure-databricksjulian-date

Converting 7 digit dates to normal calendar dates in Databricks python


I am generating data using TPC-DS.

I load the customers table to a dataframe. The c_first_sales_date_sk column has values such as 2449001, which makes me think they are Julian calendar dates of type yyyyDD.

So far I have tried:

from pyspark.sql.functions import to_date, from_unixtime
df_with_date = df.withColumn("c_first_sales_date", to_date(col("c_first_sales_date_sk"), format="yyyyDDD"))
display(df_with_date)

Applying this, it will convert 2449001 to 2449-01-01, which is wrong. The online convert at http://www.longpelaexpertise.com/toolsJulian.php converts the same date to 01-Jan-2024.

What am I doing wrong? How do I convert this column properly?


Solution

    • The tool that you are following has specifically mentioned that the tool converts Julian date (yyddd) to calendar date.
    • The 7-digit Julian date (yyyyddd) would be calculated as the number of days past (ddd) in that year (yyyy).
    • If you are sure that the number is Julian date, then the code that you are using would give appropriate results. The following are some samples that I have taken to convert 7-digit Julian dates to calender dates.
    data = [[1,'2449001'],[2,'2020111'],[3,'2010364']]
    cols = ['id','jd']
    df = spark.createDataFrame(data=data, schema=cols)
    #df.show()
    
    from pyspark.sql.functions import to_date, from_unixtime,col
    df_with_date = df.withColumn("ad", to_date(col("jd"), format="yyyyDDD"))
    display(df_with_date)
    

    enter image description here

    • However, as indicated by @FObersteiner, the numbers might indicate something else other than the Julian date (7 digit numbers).