I have to migrate below SQL query statement below into pyspark SQL query.
select case when LEN(travel_time) = 1 then
concat('0' , SUBSTRING(travel_time,PATINDEX('% ^0 %', travel_time ), LEN(travel_time)+1),':00:00')
else concat( SUBSTRING(travel_time,PATINDEX('% ^0 ', travel_time ), LEN(travel_time)+1),':00:00') end as travel_time from sampleTable
above that output, the actual MS SQL providing the solution, we need similar to pyspark with PATINDEX operation.
I have tried the below approach in pyspark:
from pyspark.sql.functions import concat, lpad, regexp_extract, length, when
dilip_df = spark.createDataFrame(
[("1:15:00",), ("07:45:00",), ("0 2:23:00",), ("0 6:50:00",)],
['travel_time']
)
dilip_df.select(
concat(
when(
length(dilip_df['travel_time']) == 1,
concat(
lpad(
regexp_extract(dilip_df['travel_time'], r'\d+', 0),
2, '0'
),
regexp_extract(dilip_df['travel_time'], r':\d+:\d+', 0)
)
).otherwise(
regexp_extract(dilip_df['travel_time'], r'\d:?\d:\d+', 0)
)
).alias('travel_time')
).show()
+-----------+
|travel_time|
+-----------+
| 15:00|
| 07:45|
| 23:00|
| 50:00|
+-----------+
The above code doing transformations on the 'travel_time' column
If the length of 'travel_time' is 1, it means there is only a single digit so
Extract the numeric part from the 'travel_time' column using regexp_extract
.
Left pad
the extracted numeric part with zeros to make it a two-digit number using lpad.
Concatenate the padded numeric part with the rest of the original time string obtained by using regexp_extract
.
If the length is not 1, it means there is more than one digit so
Extract the time part using regexp_extract
.
The final result is an alias column named 'travel_time' containing the transformed values.
Using Spark.sql:
from pyspark.sql.functions import expr
dilip_df = spark.createDataFrame(
[("1:15:00",), ("07:45:00",), ("02:23:00",), ("06:50:00",)],
['travel_time']
)
dilip_df.createOrReplaceTempView("dilip_table")
query = """
SELECT
travel_time,
CASE
WHEN REGEXP_EXTRACT(travel_time, '^\\d{1,2}:\\d{2}:\\d{2}$') IS NOT NULL THEN 'HH:MM:SS'
WHEN REGEXP_EXTRACT(travel_time, '^0\\d{1}:\\d{2}:\\d{2}$') IS NOT NULL THEN '0H:MM:SS'
WHEN REGEXP_EXTRACT(travel_time, '^0\\d{1} \\d{1,2}:\\d{2}:\\d{2}$') IS NOT NULL THEN '0H HM:SS'
ELSE NULL
END AS format
FROM dilip_table
"""
result = spark.sql(query)
result.show()
In spark.sql code It selects the 'travel_time' column with transformations applied using the CONCAT
and CASE
WHEN
expressions based on the length of the 'travel_time' column.
If the length is 1, it extracts and manipulates the numeric part of the 'travel_time' using regexp_extract
, lpad
, and concat
.
If the length is not 1, it extracts the time part using regexp_extract
.
Results:
+-----------+--------+
|travel_time| format|
+-----------+--------+
| 1:15:00|HH:MM:SS|
| 07:45:00|HH:MM:SS|
| 02:23:00|HH:MM:SS|
| 06:50:00|HH:MM:SS|
+-----------+--------+