Search code examples
pysparkapache-spark-sql

getting start and end of the week with Pyspark


I'm using the query below to get the start and end of the week. Although dayofweek is part of the pyspark.sql.functions in this format within a SQL query it doesn't seem to recognize the function. I tried concatenating the function still it did not work. My main issue is that one timestamp listed is depending on the start and end of the week range to perform correctly and output the values I'm looking.

 somedataset.registerTempTable("mytable")
 spark.sql("SELECT DISTINCT max(col1) AS col1_edited, \
                               col2, col3, max(col4) AS col4_edited,\
                               min(MyTimestamp1) as MyTimestamp1_edited,\
                               date_sub(CAST(timestamp2 AS date), dayofweek-1) as week_start,\
                               date_sub(CAST(timestamp3 AS date), 7-dayofweek) as week_end,\
                               FROM mytabble \
                               GROUP BY col1_edited, col2, col3, col4, MyTimestamp1,\
                                        timestamp2,\
                                        timestamp3").registerTempTable("mineable")

I'm using Pyspark 3.1.1+amzn.0 any help will be greatly appreciated. the output comes back it should look something like this:

enter image description here

UPDATE: based on the comment below this was the small fix (everything else remained the same.

date_sub(CAST(timestamp2 AS date), dayofweek(timestamp2)-1) as week_start,\
date_sub(CAST(timestamp3 AS date), 7-dayofweek(timestamp3)) as week_end,\

Solution

  • DAYOFWEEK should be called with a column as an argument:

    from datetime import date
    
    _data = [
        (date(2023, 10, 13), ),
        (date(2023, 10, 16), ),
        (date(2023, 10, 28), ),
    ]
    _schema = ['my_date']
    df = spark.createDataFrame(_data, _schema)
    df.createOrReplaceTempView('temp')
    q = """
    SELECT
      my_date,
      DATEADD(my_date, 1 - DAYOFWEEK(my_date)) AS WeekStart,
      DATEADD(my_date, 7 - DAYOFWEEK(my_date)) AS WeekEnd
    FROM
      temp
    """
    res = spark.sql(q)
    res.show()
    
    # +----------+----------+----------+
    # |   my_date| WeekStart|   WeekEnd|
    # +----------+----------+----------+
    # |2023-10-13|2023-10-08|2023-10-14|
    # |2023-10-16|2023-10-15|2023-10-21|
    # |2023-10-28|2023-10-22|2023-10-28|
    # +----------+----------+----------+