Search code examples
pandasstringcontains

select rows based on multiple strings in one column


in Pandas I have an order database with fashion item sales and need to calculate the return rate of some specific items.I want to select all rows which contain 'christmas', 'xmas' or 'gift' in the item name but when I try more than one keyword I get an error. Is there a way to select multiple string names? Thanks very much! This is my (summarized) dataframe:

inp = [{'sales':200, 'returns':100, 'item_name':'cool gift red'}, {'sales':150, 'returns':100, 'item_name':'giftset deluxe'},{'sales':1000, 'returns':100, 'item_name':'xmas sweater'}, {'sales':100, 'returns':100, 'item_name':'ugly christmas jumper BILLY'},
       {'sales':100, 'returns':100, 'item_name':'blue sweater'},{'sales':100, 'returns':100, 'item_name':'ugly christmas jumper JOE'},
       {'sales':100, 'returns':100, 'item_name':'orange cardigan'}
    ] 
df = pd.DataFrame(inp) 
gift=df[df.item_name.str.contains('xmas')]
gift```  

Solution

  • The [] operator in a DataFrame is a "magic" and very powerful tool. But sometimes the program will run faster with a code just a bit more verbose.

    That said, [] and .loc[] can take multiple conditions:

    gift = df_q.loc[(condition1) | (condition2) | (condition3)]

    You can use |, &, and ~ for any combination of conditions.

    If the condition is an equality/comparison, make sure you ENCLOSE conditions in parenthesis as in

    .loc[(df.A == 1) & (df.B < 100)]

    The binary operators & and | take precedence over equalities/inequalities, so not using parenthesis would be equivalent to

    (df.A == (1 & df.B)) < 100