How to Convert PATINDEX sql operation into pyspark sql engine

I have to migrate below SQL query statement below into pyspark SQL query.

select case when LEN(travel_time) = 1 then
concat('0' , SUBSTRING(travel_time,PATINDEX('% ^0 %', travel_time ), LEN(travel_time)+1),':00:00')
else concat( SUBSTRING(travel_time,PATINDEX('% ^0 ', travel_time ), LEN(travel_time)+1),':00:00') end as travel_time from sampleTable

above that output, the actual MS SQL providing the solution, we need similar to pyspark with PATINDEX operation.

Solution

I have tried the below approach in pyspark:

from pyspark.sql.functions import concat, lpad, regexp_extract, length, when
dilip_df = spark.createDataFrame(
    [("1:15:00",), ("07:45:00",), ("0 2:23:00",), ("0 6:50:00",)],
    ['travel_time']
)
dilip_df.select(
    concat(
        when(
            length(dilip_df['travel_time']) == 1,
            concat(
                lpad(
                    regexp_extract(dilip_df['travel_time'], r'\d+', 0),
                    2, '0'
                ),
                regexp_extract(dilip_df['travel_time'], r':\d+:\d+', 0)
            )
        ).otherwise(
            regexp_extract(dilip_df['travel_time'], r'\d:?\d:\d+', 0)
        )
    ).alias('travel_time')
).show()

+-----------+
|travel_time|
+-----------+
|      15:00|
|      07:45|
|      23:00|
|      50:00|
+-----------+

The above code doing transformations on the 'travel_time' column

If the length of 'travel_time' is 1, it means there is only a single digit so

Extract the numeric part from the 'travel_time' column using regexp_extract. Left pad the extracted numeric part with zeros to make it a two-digit number using lpad. Concatenate the padded numeric part with the rest of the original time string obtained by using regexp_extract. If the length is not 1, it means there is more than one digit so Extract the time part using regexp_extract. The final result is an alias column named 'travel_time' containing the transformed values.

Using Spark.sql:

from pyspark.sql.functions import expr
dilip_df = spark.createDataFrame(
    [("1:15:00",), ("07:45:00",), ("02:23:00",), ("06:50:00",)],
    ['travel_time']
)
dilip_df.createOrReplaceTempView("dilip_table")
query = """
    SELECT 
        travel_time,
        CASE 
            WHEN REGEXP_EXTRACT(travel_time, '^\\d{1,2}:\\d{2}:\\d{2}$') IS NOT NULL THEN 'HH:MM:SS'
            WHEN REGEXP_EXTRACT(travel_time, '^0\\d{1}:\\d{2}:\\d{2}$') IS NOT NULL THEN '0H:MM:SS'
            WHEN REGEXP_EXTRACT(travel_time, '^0\\d{1} \\d{1,2}:\\d{2}:\\d{2}$') IS NOT NULL THEN '0H HM:SS'
            ELSE NULL 
        END AS format
    FROM dilip_table
"""
result = spark.sql(query)
result.show()

In spark.sql code It selects the 'travel_time' column with transformations applied using the CONCAT and CASE WHEN expressions based on the length of the 'travel_time' column. If the length is 1, it extracts and manipulates the numeric part of the 'travel_time' using regexp_extract, lpad, and concat. If the length is not 1, it extracts the time part using regexp_extract.

Results:

+-----------+--------+
|travel_time|  format|
+-----------+--------+
|    1:15:00|HH:MM:SS|
|   07:45:00|HH:MM:SS|
|   02:23:00|HH:MM:SS|
|   06:50:00|HH:MM:SS|
+-----------+--------+