Suppose I have 3 lists and I have a column which is an array and I want to search in it to extract elements of those 3 lists..
Dataframe:
id desxription
1 ['this is bad', 'summerfull']
2 ['city tehran, country iran']
3 ['uA is a country', 'winternice']
5 ['this, is, summer']
6 ['this is winter','uAsal']
7 ['this is canada' ,'great']
8 ['this is toronto']
Lists:
L1 = ['summer', 'winter', 'fall']
L2 = ['iran', 'uA']
L3 = ['tehran', 'canada', 'toronto']
Now I want to make a new column with respect of each list (L1,L2,L3). Then search in each element of the lists in the description column. If the row has an element of the list, extract it in the column, otherwise NA:
Note: I want the exact match to be extracted. for example summerfull
should not be extracted by summer.
id desxription L1 L2 L3
1 ['this is bad', 'summerfull'] NA NA NA
2 ['city tehran, country iran'] NA iran tehran
3 ['uA is a country', 'winternice'] NA uA NA
5 ['this, is, summer'] summer NA NA
6 ['this is winter','uAsal'] winter NA NA
7 ['this is canada' ,'great'] NA NA canada
8 ['this is toronto'] NA NA toronto
# Create dictionary of key-vals
L = {'L1': L1, 'L2': L2, 'L3': L3}
for key, vals in L.items():
# regex pattern for extracting vals
pat = r'\\b(%s)\\b' % '|'.join(vals)
# extract matching occurrences
col = F.expr("regexp_extract_all(array_join(desxription, ' '), '%s')" % pat)
# Mask the rows with null when there are no matches
df = df.withColumn(key, F.when(F.size(col) == 0, None).otherwise(col))
>>> df.show()
+---+--------------------+--------+------+---------+
| id| desxription| L1| L2| L3|
+---+--------------------+--------+------+---------+
| 1|[this is bad, sum...| null| null| null|
| 2|[city tehran, cou...| null|[iran]| [tehran]|
| 3|[uA is a country,...| null| [uA]| null|
| 5| [this, is, summer]|[summer]| null| null|
| 6|[this is winter, ...|[winter]| null| null|
| 7|[this is canada, ...| null| null| [canada]|
| 8| [this is toronto]| null| null|[toronto]|
+---+--------------------+--------+------+---------+