Search code examples
python-3.xpysparkazure-databricks

How to Convert PATINDEX sql operation into pyspark sql engine


I have to migrate below SQL query statement below into pyspark SQL query.

select case when LEN(travel_time) = 1 then
concat('0' , SUBSTRING(travel_time,PATINDEX('% ^0 %', travel_time ), LEN(travel_time)+1),':00:00')
else concat( SUBSTRING(travel_time,PATINDEX('% ^0 ', travel_time ), LEN(travel_time)+1),':00:00') end as travel_time from sampleTable

enter image description here

above that output, the actual MS SQL providing the solution, we need similar to pyspark with PATINDEX operation.


Solution

  • I have tried the below approach in pyspark:

    from pyspark.sql.functions import concat, lpad, regexp_extract, length, when
    dilip_df = spark.createDataFrame(
        [("1:15:00",), ("07:45:00",), ("0 2:23:00",), ("0 6:50:00",)],
        ['travel_time']
    )
    dilip_df.select(
        concat(
            when(
                length(dilip_df['travel_time']) == 1,
                concat(
                    lpad(
                        regexp_extract(dilip_df['travel_time'], r'\d+', 0),
                        2, '0'
                    ),
                    regexp_extract(dilip_df['travel_time'], r':\d+:\d+', 0)
                )
            ).otherwise(
                regexp_extract(dilip_df['travel_time'], r'\d:?\d:\d+', 0)
            )
        ).alias('travel_time')
    ).show()
    
    +-----------+
    |travel_time|
    +-----------+
    |      15:00|
    |      07:45|
    |      23:00|
    |      50:00|
    +-----------+
    

    The above code doing transformations on the 'travel_time' column

    If the length of 'travel_time' is 1, it means there is only a single digit so

    Extract the numeric part from the 'travel_time' column using regexp_extract. Left pad the extracted numeric part with zeros to make it a two-digit number using lpad. Concatenate the padded numeric part with the rest of the original time string obtained by using regexp_extract. If the length is not 1, it means there is more than one digit so Extract the time part using regexp_extract. The final result is an alias column named 'travel_time' containing the transformed values.

    Using Spark.sql:

    from pyspark.sql.functions import expr
    dilip_df = spark.createDataFrame(
        [("1:15:00",), ("07:45:00",), ("02:23:00",), ("06:50:00",)],
        ['travel_time']
    )
    dilip_df.createOrReplaceTempView("dilip_table")
    query = """
        SELECT 
            travel_time,
            CASE 
                WHEN REGEXP_EXTRACT(travel_time, '^\\d{1,2}:\\d{2}:\\d{2}$') IS NOT NULL THEN 'HH:MM:SS'
                WHEN REGEXP_EXTRACT(travel_time, '^0\\d{1}:\\d{2}:\\d{2}$') IS NOT NULL THEN '0H:MM:SS'
                WHEN REGEXP_EXTRACT(travel_time, '^0\\d{1} \\d{1,2}:\\d{2}:\\d{2}$') IS NOT NULL THEN '0H HM:SS'
                ELSE NULL 
            END AS format
        FROM dilip_table
    """
    result = spark.sql(query)
    result.show()
    

    In spark.sql code It selects the 'travel_time' column with transformations applied using the CONCAT and CASE WHEN expressions based on the length of the 'travel_time' column. If the length is 1, it extracts and manipulates the numeric part of the 'travel_time' using regexp_extract, lpad, and concat. If the length is not 1, it extracts the time part using regexp_extract.

    Results:

    +-----------+--------+
    |travel_time|  format|
    +-----------+--------+
    |    1:15:00|HH:MM:SS|
    |   07:45:00|HH:MM:SS|
    |   02:23:00|HH:MM:SS|
    |   06:50:00|HH:MM:SS|
    +-----------+--------+