I have a dataframe like this:
df = [{'id': 1, 'id1': '859A;'},
{'id': 2, 'id1': '209A/229A/509A;'},
{'id': 3, 'id1': '(105A/111A/121A/131A/201A/205A/211A/221A/231A/509A/801A/805A/811A/821A)+TZ+-494;'},
{'id': 4, 'id1': '111A/114A/121A/131A/201A/211A/221A/231A/651A+-(Y05/U17)/801A/804A/821A;'},
{'id': 5, 'id1': '(651A/851A)+U17/861A;'},
]
df = spark.createDataFrame(df)
I want to split the "id1" column into two columns. One column needs to only extract strings which end with "A" and put them in a sequence with "/" between strings. The other column needs to extract the remaining strings and place them in a separate column as shown below.
Taking "id3", "id5" and "id2" as example, the desired output should be:
newcolumn1
(105A1,11A,121A,131A/201A,205A,211A,221A,231A/509A/801A,805A,811A,821A)
(651A/851A,861A)
(209A,229A/509A)
newcolumn2
+TZ+-494;
+U17;
blank
All series starting with "1" and ending with "A" should be in one group, separated with comma. Every such series should be separated with "/".
Your best bet is to use regex. regexp_extract_all
is not yet directly available in Python API, but you can use expr
to reach it. You will also need a couple of consecutive aggregations.
from pyspark.sql import functions as F
cols = df.columns
df = df.withColumn('_vals', F.explode(F.expr(r"regexp_extract_all(id1, '\\d+A', 0)")))
df = (df
.groupBy(*cols, F.substring('_vals', 1, 1)).agg(
F.array_join(F.array_sort(F.collect_list('_vals')), ',').alias('_vals')
).groupBy(cols).agg(
F.array_join(F.array_sort(F.collect_list('_vals')), '/').alias('newcolumn1')
).withColumn('newcolumn1', F.format_string('(%s)', 'newcolumn1')
).withColumn('newcolumn2', F.regexp_replace('id1', r'\d+A|/|\(|\)', ''))
)
df.show(truncate=0)
# +---+--------------------------------------------------------------------------------+-----------------------------------------------------------------------+----------+
# |id |id1 |newcolumn1 |newcolumn2|
# +---+--------------------------------------------------------------------------------+-----------------------------------------------------------------------+----------+
# |3 |(105A/111A/121A/131A/201A/205A/211A/221A/231A/509A/801A/805A/811A/821A)+TZ+-494;|(105A,111A,121A,131A/201A,205A,211A,221A,231A/509A/801A,805A,811A,821A)|+TZ+-494; |
# |5 |(651A/851A)+U17/861A; |(651A/851A,861A) |+U17; |
# |2 |209A/229A/509A; |(209A,229A/509A) |; |
# |4 |111A/114A/121A/131A/201A/211A/221A/231A/651A+-(Y05/U17)/801A/804A/821A; |(111A,114A,121A,131A/201A,211A,221A,231A/651A/801A,804A,821A) |+-Y05U17; |
# |1 |859A; |(859A) |; |
# +---+--------------------------------------------------------------------------------+-----------------------------------------------------------------------+----------+