Extracting pattern in a string in pyspark using regex

I have a string named rasm_4_Becv0_0_1234_rasm_3exm I want to extract the digits after Becv that is 0_0_1234 in pysaprk.

Can anyone please suggest what will be the regular expression for this pattern? The digits are changing.

Solution

This code should be able to extract the pattern that you are looking for. I added some dummy data in the form of:


strings
-------------------------------
rasm_4_Becv0_0_1230_rasm_3exm
rasm_4_Becv0_0_1231_rasm_3exm
rasm_4_Becv0_0_1232_rasm_3exm
rasm_4_Becv0_0_1233_rasm_3exm
rasm_4_Becv0_0_123{i}_rasm_3exm

from pyspark.sql import Row
from pyspark.sql.types import StructType, StringType, StructField
from pyspark.sql import functions as f

# build the DataFrame
data = []
for i in range(5):
    data.append(f"rasm_4_Becv0_0_123{i}_rasm_3exm")
df = spark.createDataFrame(data=[Row(x) for x in data], schema=StructType([StructField("strings", StringType(), True)]))

# extract the pattern
regex = r"(\d_\d_\d{4})"
group_idx = 1
df_new = df.withColumn("extracted_string", f.regexp_extract(f.col("strings"), regex, group_idx))

This pattern works as well as the dollar sign works as a group extractor. So $1 means group1 and $2 means group2 and so on.

# extract the pattern
regex = r".*(\d_\d_\d{4}).*"
replacement = "$1"
df_new = df.withColumn("extracted_string", f.regexp_replace(f.col("strings"), regex, repl))