Suppose I have 3 lists and I want to search in one of the columns of my dataframe among them.
Dataframe:
id desxription
1 'this is bad'
2 'city tehran country iran'
3 'uA is a country'
5 'this is summer'
6 'this is winter'
7 'this is canada'
8 'this is toronto'
Lists:
L1 = ['summer', 'winter', 'fall']
L2 = ['iran', 'uA']
L3 = ['tehran', 'canada', 'toronto']
Now I want to make a new column with respect of each list. If the row has an element of the list, extract it in the column, otherwise NA:
id desxription L1 L2 L3
1 'this is bad' NA NA NA
2 'city tehran country iran' NA iran tehran
3 'uA is a country' NA uA NA
5 'this is summer' summer NA NA
6 'this is winter' winter NA NA
7 'this is canada' NA NA canada
8 'this is toronto' NA NA toronto
You could use regexp_extract
. While forming the pattern, use \b
which will match word boundaries.
df = (df
.withColumn('L1', F.regexp_extract('desxription', r'(\b' + r'\b)|(\b'.join(L1) + r'\b)', 0))
.withColumn('L2', F.regexp_extract('desxription', r'(\b' + r'\b)|(\b'.join(L2) + r'\b)', 0))
.withColumn('L3', F.regexp_extract('desxription', r'(\b' + r'\b)|(\b'.join(L3) + r'\b)', 0))
.replace({'': 'NA'})
)