Search code examples
listdataframeapache-sparkpysparkcontains

How to determine if a column of a dataframe has an element of several different lists?


Suppose I have 3 lists and I want to search in one of the columns of my dataframe among them.

Dataframe:

id      desxription
1         'this is bad'
2         'city tehran country iran'
3         'uA is a country'
5         'this is summer'
6         'this is winter'
7         'this is canada'
8         'this is toronto'

Lists:

L1 = ['summer', 'winter', 'fall']
L2 = ['iran', 'uA']
L3 = ['tehran', 'canada', 'toronto']

Now I want to make a new column with respect of each list. If the row has an element of the list, extract it in the column, otherwise NA:

id      desxription                       L1          L2         L3
1         'this is bad'                      NA         NA          NA
2         'city tehran country iran'       NA         iran        tehran
3         'uA is a country'                  NA         uA          NA
5         'this is summer'                  summer      NA          NA
6         'this is winter'                  winter      NA          NA
7         'this is canada'                 NA         NA          canada
8         'this is toronto'               NA         NA          toronto

Solution

  • You could use regexp_extract. While forming the pattern, use \b which will match word boundaries.

    df = (df
        .withColumn('L1', F.regexp_extract('desxription', r'(\b' + r'\b)|(\b'.join(L1) + r'\b)', 0))
        .withColumn('L2', F.regexp_extract('desxription', r'(\b' + r'\b)|(\b'.join(L2) + r'\b)', 0))
        .withColumn('L3', F.regexp_extract('desxription', r'(\b' + r'\b)|(\b'.join(L3) + r'\b)', 0))
        .replace({'': 'NA'})
    )