Search code examples
pythonregexpandasflattendummy-variable

Can I use regular expressions search or match on a Python Pandas column where each cell is a list of lists?


I have a somewhat large CSV file (>2,000 rows) I've read into Pandas and want to create a new indicator column based on whether or not a specific word appears in one of the data columns. I have been trying to use regex search, which may be overkill because the word will always appear split by spaces, but the cells of the DataFrame are lists of lists of strings. I've tried iterating using a double list comprehension, but there are errors, and I am also curious, as a Python newbie, if there is a generalized solution for an unspecified number of nested lists to flatten. Here is an example, where my end goal is a new column with 1 in the rows where the word 'saddle' appears anywhere in the cell from the selected column, and 0 if not.

My DataFrame looks like this

import pandas as pd
import numpy as np

cycling = pd.DataFrame(
    {
        'qty' : [1,0,2,1,1],
        'item' : ['frame','frame',np.nan,'order including a saddle and other things','brake'],
        'desc' : [np.nan,['bike','wheel'],['bike',['tire','tube']],['saddle',['seatpost','bag']],['bike','brakes']]
    }
)

Here is the DataFrame

I can search the item column to achieve my goal, using this code (efficiency and other suggestions VERY welcome!!):

cycling['saddle1'] = [int(bool(re.search(r"saddle",x))) for x in cycling['item'].replace(np.nan,'missing')]

My original dataset has missing values that I want to resolve to 0 in the indicator column; I don't care about them otherwise. The above code works great for a column with strings in each cell the fourth row is correctly identified, but I can't modify it to work when the cell contains a list, or a list of lists, like the desc column. I tried:

cycling['saddle2'] = [int(bool(re.search(r"saddle",x))) for y in cycling['desc'].replace(np.nan,'missing') for x in y]

but I get the following error

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-45-4c72cdaa87a4> in <module>()
----> 1 cycling['saddle2'] = [int(bool(re.search(r"saddle",x))) for y in cycling['desc'].replace(np.nan,'missing') for x in y]
      2 cycling.head()

1 frames
/usr/lib/python3.6/re.py in search(pattern, string, flags)
    180     """Scan through string looking for a match to the pattern, returning
    181     a match object, or None if no match was found."""
--> 182     return _compile(pattern, flags).search(string)
    183 
    184 def sub(pattern, repl, string, count=0, flags=0):

TypeError: expected string or bytes-like object

I think that error is that it doesn't like receiving non-strings for regex (maybe the unflattened lists?). Is there a way to search a column in Pandas for a specific word (possibly using regex), where some cells are lists of strings, some are lists of strings that also contain nested lists, and some cells are missing, to create an indicator column, with a 1 for anywhere it appears (whether it's nested or not), and 0 otherwise?


Solution

  • Instead of running a for loop (which is slow) you can use map. You can convert the list to str for calling the regex. Like this:-

    import pandas as pd
    import numpy as np
    import re
    
    cycling = pd.DataFrame(
        {
            'qty' : [1,0,2,1,1],
            'item' : ['frame','frame',np.nan,'order including a saddle and other things','brake'],
            'desc' : [np.nan,['bike','wheel'],['bike',['tire','tube']],['saddle',['seatpost','bag']],['bike','brakes']]
        }
    )
    cycling['saddle1'] = cycling['item'].replace(np.nan,'missing').map(lambda x :int(bool(re.search(r"saddle",x))))
    cycling['saddle2'] = cycling['desc'].replace(np.nan,'missing').map(lambda x :int(bool(re.search(r"saddle",str(x)))))
    
    cycling
    

    Hope this helps!!1