Search code examples
regexpyspark

Extracting pattern in a string in pyspark using regex


I have a string named rasm_4_Becv0_0_1234_rasm_3exm I want to extract the digits after Becv that is 0_0_1234 in pysaprk.

Can anyone please suggest what will be the regular expression for this pattern? The digits are changing.


Solution

  • This code should be able to extract the pattern that you are looking for. I added some dummy data in the form of:

    
    strings
    -------------------------------
    rasm_4_Becv0_0_1230_rasm_3exm
    rasm_4_Becv0_0_1231_rasm_3exm
    rasm_4_Becv0_0_1232_rasm_3exm
    rasm_4_Becv0_0_1233_rasm_3exm
    rasm_4_Becv0_0_123{i}_rasm_3exm
    
    
    from pyspark.sql import Row
    from pyspark.sql.types import StructType, StringType, StructField
    from pyspark.sql import functions as f
    
    # build the DataFrame
    data = []
    for i in range(5):
        data.append(f"rasm_4_Becv0_0_123{i}_rasm_3exm")
    df = spark.createDataFrame(data=[Row(x) for x in data], schema=StructType([StructField("strings", StringType(), True)]))
    
    # extract the pattern
    regex = r"(\d_\d_\d{4})"
    group_idx = 1
    df_new = df.withColumn("extracted_string", f.regexp_extract(f.col("strings"), regex, group_idx))
    

    This pattern works as well as the dollar sign works as a group extractor. So $1 means group1 and $2 means group2 and so on.

    # extract the pattern
    regex = r".*(\d_\d_\d{4}).*"
    replacement = "$1"
    df_new = df.withColumn("extracted_string", f.regexp_replace(f.col("strings"), regex, repl))