string apache-spark pyspark apache-spark-sql extract

Extracting a specific part from a string column in Pyspark

In one of my projects, I need to transform a string column whose values looks like below

"[44252-565333] result[0] - /out/ALL/abc12345_ID.xml.gz"

"[44252-565333] result[0] - /out/ALL/abc12_ID.xml.gz"

I only need the alphanumeric values after "All/" and before "_ID", so the 1st record should be "abc12345" and second record should be "abc12".

in PySpark, I am using substring in withColumn to get the first 8 strings after "ALL/" position which gives me "abc12345" and "abc12_ID".

Then I am using regexp_replace in withColumn to check if rlike is "_ID$", then replace "_ID" with "", otherwise keep the column value. This is giving the expected result: "abc12345" and "abc12".

But is there a better solution for this?

Solution

Maybe like this? In one regexp_extract?

F.regexp_extract('col_name', r'ALL\/([^\W_]+)', 1)

Test:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [("[44252-565333] result[0] - /out/ALL/abc12345_ID.xml.gz",),
     ("[44252-565333] result[0] - /out/ALL/abc12_ID.xml.gz",)],
    ["col_name"])

df = df.withColumn("col2", F.regexp_extract("col_name", r"ALL\/([^\W_]+)", 1))

df.show(truncate=0)
# +------------------------------------------------------+--------+
# |col_name                                              |col2    |
# +------------------------------------------------------+--------+
# |[44252-565333] result[0] - /out/ALL/abc12345_ID.xml.gz|abc12345|
# |[44252-565333] result[0] - /out/ALL/abc12_ID.xml.gz   |abc12   |
# +------------------------------------------------------+--------+