I am on the process of cleaning a dataframe, and I want to check if there are any values from a list of words in a dataframe. If it is present, the value should be replaced by NA values. For example,
My dataframe is like.
p['title']
1 Forest
2 [VIDEO_TITLE]
3 [VIDEO_TITLE]
4 [VIDEO_TITLE]
5 [${title}url=${videourl}]
p.dtypes
title object
dtype: object
and
c= ('${title}', '[VIDEO_TITLE]')
Since the rows 2,3,4,5 have the words in c, I want that to be replaced by NA values.
I'm trying the following,
p['title'].replace('|'.join(c),np.NAN,regex=True).fillna('NA')
This one runs without error, but I am getting the same input as output. There are no changes at all.
My next try is,
p['title'].apply(lambda x: 'NA' if any(s in x for s in c) else x)
which is throwing an error,
TypeError: argument of type 'float' is not iterable
I am trying several other things without much success. I am not sure what mistake I am doing.
My ideal output would be,
p['title']
1 Forest
2 NA
3 NA
4 NA
5 NA
Can anybody help me in solving this?
You can loc
to set them as 'NA'
. Since your values are sometimes inside a list, first they need to be extracted from the list. The second line extracts the first string from the list, if it's in a list. The third line checks for a match.
c = ('${title}', 'VIDEO_TITLE')
string_check = p['title'].map(lambda x: x if not isinstance(x, list) else x[0])
string_check = string_check.map(lambda s: any(c_str in s for c_str in c))
p.loc[string_check, 'title'] = 'NA'
Depending on what you're doing, you may want to consider setting the values to numpy.nan
instead of the string 'NA'
. This is the usual way pandas handles null values and there's a lot of functionality already built around this.