I have a somewhat large CSV file (>2,000 rows) I've read into Pandas and want to create a new indicator column based on whether or not a specific word appears in one of the data columns. I have been trying to use regex search, which may be overkill because the word will always appear split by spaces, but the cells of the DataFrame are lists of lists of strings. I've tried iterating using a double list comprehension, but there are errors, and I am also curious, as a Python newbie, if there is a generalized solution for an unspecified number of nested lists to flatten. Here is an example, where my end goal is a new column with 1
in the rows where the word 'saddle'
appears anywhere in the cell from the selected column, and 0
if not.
My DataFrame looks like this
import pandas as pd
import numpy as np
cycling = pd.DataFrame(
{
'qty' : [1,0,2,1,1],
'item' : ['frame','frame',np.nan,'order including a saddle and other things','brake'],
'desc' : [np.nan,['bike','wheel'],['bike',['tire','tube']],['saddle',['seatpost','bag']],['bike','brakes']]
}
)
I can search the item
column to achieve my goal, using this code (efficiency and other suggestions VERY welcome!!):
cycling['saddle1'] = [int(bool(re.search(r"saddle",x))) for x in cycling['item'].replace(np.nan,'missing')]
My original dataset has missing values that I want to resolve to 0
in the indicator column; I don't care about them otherwise. The above code works great for a column with strings in each cell the fourth row is correctly identified, but I can't modify it to work when the cell contains a list, or a list of lists, like the desc
column. I tried:
cycling['saddle2'] = [int(bool(re.search(r"saddle",x))) for y in cycling['desc'].replace(np.nan,'missing') for x in y]
but I get the following error
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-45-4c72cdaa87a4> in <module>()
----> 1 cycling['saddle2'] = [int(bool(re.search(r"saddle",x))) for y in cycling['desc'].replace(np.nan,'missing') for x in y]
2 cycling.head()
1 frames
/usr/lib/python3.6/re.py in search(pattern, string, flags)
180 """Scan through string looking for a match to the pattern, returning
181 a match object, or None if no match was found."""
--> 182 return _compile(pattern, flags).search(string)
183
184 def sub(pattern, repl, string, count=0, flags=0):
TypeError: expected string or bytes-like object
I think that error is that it doesn't like receiving non-strings for regex (maybe the unflattened lists?). Is there a way to search a column in Pandas for a specific word (possibly using regex), where some cells are lists of strings, some are lists of strings that also contain nested lists, and some cells are missing, to create an indicator column, with a 1
for anywhere it appears (whether it's nested or not), and 0
otherwise?
Instead of running a for loop (which is slow) you can use map
. You can convert the list to str
for calling the regex. Like this:-
import pandas as pd
import numpy as np
import re
cycling = pd.DataFrame(
{
'qty' : [1,0,2,1,1],
'item' : ['frame','frame',np.nan,'order including a saddle and other things','brake'],
'desc' : [np.nan,['bike','wheel'],['bike',['tire','tube']],['saddle',['seatpost','bag']],['bike','brakes']]
}
)
cycling['saddle1'] = cycling['item'].replace(np.nan,'missing').map(lambda x :int(bool(re.search(r"saddle",x))))
cycling['saddle2'] = cycling['desc'].replace(np.nan,'missing').map(lambda x :int(bool(re.search(r"saddle",str(x)))))
cycling
Hope this helps!!1