Search code examples
datepysparktimestampdayofweek

extract day from the first week of year


I try to extract in pyspark the date of Sunday from every given week in a year. Week and year are in the format yyyyww. This is possible for every week except the first week, in this case, a got a null value. This is the sample code and result.

columns = ['id', 'week_year']
vals = [
     (1, 201952),
     (2, 202001),
     (3, 202002),
     (4, 201901),
     (5, 201902)
]
df = spark.createDataFrame(vals, columns)
+---+---------+
| id|week_year|
+---+---------+
|  1|   201952|
|  2|   202001|
|  3|   202002|
|  4|   201901|
|  5|   201902|
+---+---------+
df = df.withColumn("day", to_timestamp(concat(df.week_year, lit("-Sunday")), 'yyyyww-E'))

As a result I got

+---+---------+-------------------+
| id|week_year|                day|
+---+---------+-------------------+
|  1|   201952|2019-12-22 00:00:00|
|  2|   202001|               null|
|  3|   202002|2020-01-05 00:00:00|
|  4|   201901|               null|
|  5|   201902|2019-01-06 00:00:00|
+---+---------+-------------------+

Do you have an idea, why it does not work for the first week? It is also strange for me that 5.01 and 6.01 are in second week, not in the first.


Solution

  • If you look at the calendar for 2020, the year starts on wednesday, which is in the middle of 1st week and that first week doesn't have a sunday. Same goes for the 2019. That is why 2020-01-05 is coming in the second week. enter image description here

    Hope this helps!