Search code examples
stringapache-sparkpysparkextractazure-databricks

Extract specific string from a column and place them in a sequence


I have a dataframe like this:

df = [{'id': 1, 'id1': '859A;'},
      {'id': 2, 'id1': '209A/229A/509A;'},
      {'id': 3, 'id1': '(105A/111A/121A/131A/201A/205A/211A/221A/231A/509A/801A/805A/811A/821A)+TZ+-494;'},
      {'id': 4, 'id1': '111A/114A/121A/131A/201A/211A/221A/231A/651A+-(Y05/U17)/801A/804A/821A;'},
      {'id': 5, 'id1': '(651A/851A)+U17/861A;'},
     ]
df = spark.createDataFrame(df)

I want to split the "id1" column into two columns. One column needs to only extract strings which end with "A" and put them in a sequence with "/" between strings. The other column needs to extract the remaining strings and place them in a separate column as shown below.

Taking "id3", "id5" and "id2" as example, the desired output should be:

newcolumn1

(105A1,11A,121A,131A/201A,205A,211A,221A,231A/509A/801A,805A,811A,821A)   
(651A/851A,861A)                                                          
(209A,229A/509A)

newcolumn2

+TZ+-494;
+U17;
blank

All series starting with "1" and ending with "A" should be in one group, separated with comma. Every such series should be separated with "/".


Solution

  • Your best bet is to use regex. regexp_extract_all is not yet directly available in Python API, but you can use expr to reach it. You will also need a couple of consecutive aggregations.

    from pyspark.sql import functions as F
    
    cols = df.columns
    df = df.withColumn('_vals', F.explode(F.expr(r"regexp_extract_all(id1, '\\d+A', 0)")))
    df = (df
        .groupBy(*cols, F.substring('_vals', 1, 1)).agg(
            F.array_join(F.array_sort(F.collect_list('_vals')), ',').alias('_vals')
        ).groupBy(cols).agg(
            F.array_join(F.array_sort(F.collect_list('_vals')), '/').alias('newcolumn1')
    ).withColumn('newcolumn1', F.format_string('(%s)', 'newcolumn1')
    ).withColumn('newcolumn2', F.regexp_replace('id1', r'\d+A|/|\(|\)', ''))
    )
    
    df.show(truncate=0)
    # +---+--------------------------------------------------------------------------------+-----------------------------------------------------------------------+----------+
    # |id |id1                                                                             |newcolumn1                                                             |newcolumn2|
    # +---+--------------------------------------------------------------------------------+-----------------------------------------------------------------------+----------+
    # |3  |(105A/111A/121A/131A/201A/205A/211A/221A/231A/509A/801A/805A/811A/821A)+TZ+-494;|(105A,111A,121A,131A/201A,205A,211A,221A,231A/509A/801A,805A,811A,821A)|+TZ+-494; |
    # |5  |(651A/851A)+U17/861A;                                                           |(651A/851A,861A)                                                       |+U17;     |
    # |2  |209A/229A/509A;                                                                 |(209A,229A/509A)                                                       |;         |
    # |4  |111A/114A/121A/131A/201A/211A/221A/231A/651A+-(Y05/U17)/801A/804A/821A;         |(111A,114A,121A,131A/201A,211A,221A,231A/651A/801A,804A,821A)          |+-Y05U17; |
    # |1  |859A;                                                                           |(859A)                                                                 |;         |
    # +---+--------------------------------------------------------------------------------+-----------------------------------------------------------------------+----------+