I'm trying to get the day of the week in int format from it saying Mon, Tues, Wed, etc.
I'm using this code right now:
from pyspark.sql.functions import date_format
df_ex1 = df.withColumn("day", date_format('Timestamp', 'E'))
Output:
+-------------------+----+----+----+-----+------------+-----------------+--------------+----+-----+---+
|Timestamp |Open|High|Low |Close|Volume_(BTC)|Volume_(Currency)|Weighted_Price|Year|Month|day|
+-------------------+----+----+----+-----+------------+-----------------+--------------+----+-----+---+
|2011-12-31 09:52:00|4.39|4.39|4.39|4.39 |0.45558086 |2.0 |4.39 |2011|12 |Sat|
|2011-12-31 09:53:00|NaN |NaN |NaN |NaN |NaN |NaN |NaN |2011|12 |Sat|
|2011-12-31 09:54:00|NaN |NaN |NaN |NaN |NaN |NaN |NaN |2011|12 |Sat|
|2011-12-31 09:55:00|NaN |NaN |NaN |NaN |NaN |NaN |NaN |2011|12 |Sat|
|2011-12-31 09:56:00|NaN |NaN |NaN |NaN |NaN |NaN |NaN |2011|12 |Sat|
+-------------------+----+----+----+-----+------------+-----------------+--------------+----+-----+---+
only showing top 5 rows
There are two Spark native ways: dayofweek
and weekday
(the latter - only SQL).
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('2022-08-01', 'Mon'),
('2022-08-02', 'Tue'),
('2022-08-03', 'Wed'),
('2022-08-04', 'Thu'),
('2022-08-05', 'Fri'),
('2022-08-06', 'Sat'),
('2022-08-07', 'Sun')],
['Timestamp', 'day'])
df = df.withColumn('day_number_1', F.dayofweek('Timestamp'))
df = df.withColumn('day_number_2', F.expr("weekday(Timestamp)"))
df.show()
# +----------+---+------------+------------+
# | Timestamp|day|day_number_1|day_number_2|
# +----------+---+------------+------------+
# |2022-08-01|Mon| 2| 0|
# |2022-08-02|Tue| 3| 1|
# |2022-08-03|Wed| 4| 2|
# |2022-08-04|Thu| 5| 3|
# |2022-08-05|Fri| 6| 4|
# |2022-08-06|Sat| 7| 5|
# |2022-08-07|Sun| 1| 6|
# +----------+---+------------+------------+