Search code examples
python-3.xpandasdataframedata-analysisdata-cleaning

How to remove a complete row when no match found in a column's string values with any object from a given list?


Please help me complete this piece of code. Let me know of any other detail is required.

Thanks in advance!

Given: a column 'PROD_NAME' from pandas dataframe of string type (e.g. Smiths Crinkle Cut Chips Chicken g), a list of certain words (['Chip', 'Chips' etc])

To do: if none of the words from the list is contained in the strings of the dataframe objects, we drop the whole row. Basically we're removing unnecessary products from a dataframe.

This is what data looks like:

enter image description here

Here's my code:

# create a function to Keep only those products which have 
# chip, chips, doritos, dorito, pringle, Pringles, Chps, chp, in their name
def onlyChips(df, *cols):
    temp = []
    chips = ['Chip', 'Chips', 'Doritos', 'Dorito', 'Pringle', 'Pringles', 'Chps', 'Chp']
    copy = cp.deepcopy(df)
    for col in [*cols]:
        for i in range(len(copy[col])):
            for item in chips:
                if item not in copy[col][i]:
                    flag = False
                else: 
                    flag = True
                    break;
                # drop only those string which doesn't have any match from chips list, if flag never became True
                if not flag:
                    # drop the whole row
    return <new created dataframe>

new = onlyChips(df_txn, 'PROD_NAME')

Solution

  • Filter the rows instead of deleting them. Create a boolean mask for each row. Use str.contains on each column you need to search and see if any of the columns match the given criteria row-wise. Filter the rows if not.

    search_cols = ['PROD_NAME']
    mask = df[search_cols].apply(lambda x: x.str.contains('|'.join(chips))).any(axis=1)
    df = df[mask]