Search code examples
pythonpython-2.7numpypandasdata-cleaning

Replace row value with NaN if particular word is present - Python


I am on the process of cleaning a dataframe, and I want to check if there are any values from a list of words in a dataframe. If it is present, the value should be replaced by NA values. For example,

My dataframe is like.

p['title']

1                                             Forest
2                                            [VIDEO_TITLE]
3                                            [VIDEO_TITLE]
4                                            [VIDEO_TITLE]
5                                [${title}url=${videourl}]


p.dtypes
title    object
dtype: object

and

c= ('${title}', '[VIDEO_TITLE]')

Since the rows 2,3,4,5 have the words in c, I want that to be replaced by NA values.

I'm trying the following,

p['title'].replace('|'.join(c),np.NAN,regex=True).fillna('NA')

This one runs without error, but I am getting the same input as output. There are no changes at all.

My next try is,

p['title'].apply(lambda x: 'NA' if any(s in x for s in c) else x)

which is throwing an error,

TypeError: argument of type 'float' is not iterable

I am trying several other things without much success. I am not sure what mistake I am doing.

My ideal output would be,

p['title']

1     Forest
2        NA
3        NA
4        NA
5        NA

Can anybody help me in solving this?


Solution

  • You can loc to set them as 'NA'. Since your values are sometimes inside a list, first they need to be extracted from the list. The second line extracts the first string from the list, if it's in a list. The third line checks for a match.

    c = ('${title}', 'VIDEO_TITLE')
    string_check = p['title'].map(lambda x: x if not isinstance(x, list) else x[0])
    string_check = string_check.map(lambda s: any(c_str in s for c_str in c))
    p.loc[string_check, 'title'] = 'NA'
    

    Depending on what you're doing, you may want to consider setting the values to numpy.nan instead of the string 'NA'. This is the usual way pandas handles null values and there's a lot of functionality already built around this.