Search code examples
pysparkcontainstext-extraction

how I can search in a list of items and extract some key words based on elements of 3 lists?


Suppose I have 3 lists and I have a column which is an array and I want to search in it to extract elements of those 3 lists..

Dataframe:

id               desxription                 
1         ['this is bad', 'summerfull']     
2         ['city tehran, country iran']       
3         ['uA is a country', 'winternice'] 
5         ['this, is, summer']              
6         ['this is winter','uAsal']       
7         ['this is canada' ,'great']       
8         ['this is toronto']            

Lists:

L1 = ['summer', 'winter', 'fall']
L2 = ['iran', 'uA']
L3 = ['tehran', 'canada', 'toronto']

Now I want to make a new column with respect of each list (L1,L2,L3). Then search in each element of the lists in the description column. If the row has an element of the list, extract it in the column, otherwise NA:

Note: I want the exact match to be extracted. for example summerfull should not be extracted by summer.

id               desxription                  L1          L2         L3
1         ['this is bad', 'summerfull']       NA         NA          NA
2         ['city tehran, country iran']       NA         iran        tehran
3         ['uA is a country', 'winternice']   NA         uA          NA
5         ['this, is, summer']               summer      NA          NA
6         ['this is winter','uAsal']         winter      NA          NA
7         ['this is canada' ,'great']        NA         NA          canada
8         ['this is toronto']                NA         NA          toronto

Solution

  • Annotated code

    # Create dictionary of key-vals
    L = {'L1': L1, 'L2': L2, 'L3': L3}
    
    for key, vals in L.items():
        # regex pattern for extracting vals
        pat = r'\\b(%s)\\b' % '|'.join(vals)
    
        # extract matching occurrences
        col = F.expr("regexp_extract_all(array_join(desxription, ' '), '%s')" % pat)
    
        # Mask the rows with null when there are no matches
        df = df.withColumn(key, F.when(F.size(col) == 0, None).otherwise(col))
    

    >>> df.show()
    
    +---+--------------------+--------+------+---------+
    | id|         desxription|      L1|    L2|       L3|
    +---+--------------------+--------+------+---------+
    |  1|[this is bad, sum...|    null|  null|     null|
    |  2|[city tehran, cou...|    null|[iran]| [tehran]|
    |  3|[uA is a country,...|    null|  [uA]|     null|
    |  5|  [this, is, summer]|[summer]|  null|     null|
    |  6|[this is winter, ...|[winter]|  null|     null|
    |  7|[this is canada, ...|    null|  null| [canada]|
    |  8|   [this is toronto]|    null|  null|[toronto]|
    +---+--------------------+--------+------+---------+