Search code examples
stringapache-sparkpysparkapache-spark-sqlextract

Extracting a specific part from a string column in Pyspark


In one of my projects, I need to transform a string column whose values looks like below

"[44252-565333] result[0] - /out/ALL/abc12345_ID.xml.gz"

"[44252-565333] result[0] - /out/ALL/abc12_ID.xml.gz"

I only need the alphanumeric values after "All/" and before "_ID", so the 1st record should be "abc12345" and second record should be "abc12".

in PySpark, I am using substring in withColumn to get the first 8 strings after "ALL/" position which gives me "abc12345" and "abc12_ID".

Then I am using regexp_replace in withColumn to check if rlike is "_ID$", then replace "_ID" with "", otherwise keep the column value. This is giving the expected result: "abc12345" and "abc12".

But is there a better solution for this?


Solution

  • Maybe like this? In one regexp_extract?

    F.regexp_extract('col_name', r'ALL\/([^\W_]+)', 1)
    

    Test:

    from pyspark.sql import functions as F
    df = spark.createDataFrame(
        [("[44252-565333] result[0] - /out/ALL/abc12345_ID.xml.gz",),
         ("[44252-565333] result[0] - /out/ALL/abc12_ID.xml.gz",)],
        ["col_name"])
    
    df = df.withColumn("col2", F.regexp_extract("col_name", r"ALL\/([^\W_]+)", 1))
    
    df.show(truncate=0)
    # +------------------------------------------------------+--------+
    # |col_name                                              |col2    |
    # +------------------------------------------------------+--------+
    # |[44252-565333] result[0] - /out/ALL/abc12345_ID.xml.gz|abc12345|
    # |[44252-565333] result[0] - /out/ALL/abc12_ID.xml.gz   |abc12   |
    # +------------------------------------------------------+--------+