Search code examples
pandaspython-2.7dictionary

If dataframe column value matches dictionary key, check if different column matches dictionary value


I have a dataframe with 2 columns of interest. Both are full of strings. I also have a dictionary of mapped key value pairs which are also strings. I'm using the dictionary's keys to filter the dataframe by the first column for only those keys that are in dictionary.

The ultimate goal is to then lookup the first column of the dataframe match it to the key in the dictionary and then confirm column 2's value matches the value to the dictionary.

The filtered dataframe on the keys of interest is working as expected, so I'm left with a dataframe of two columns that have only column keys that are present in the dictionary. The filtered down dataframe can be anywhere from a few rows to thousands of rows but the dictionary is static in length.

A final output should be a dataframe whose contents show rows of the filtered dataframe where the values of the second column don't match the values of the dictionary.

pairs = {'red': 'apple', 'blue': 'blueberry', 'yellow':'banana'}
filtered_data = {'Color':['red', 'blue'], 'Fruit':['appl','blueberry']}
filtered_df = pd.DataFrame(filtered_data)

#so the filtered_df would resemble
Color     Fruit
red       appl
blue      blueberry

for row in filtered_df.iterrows():
   for k,v in pairs.items():
       #Here's where I'd like to check the value of column 1, find it in the dict then if the 
       #values dont match between col 2 in the df and the dict, append the mismatched row to a 
       #new df.
       if row['Color'] == k:
          new_df.append(row).where(row['Fruit'] != v)

I'm sure I need an index with the row in the first for loop but I'm not sure how to format the rest of the nested loop structure.

Ideally, when I export my new_df dataframe in this scenario it would have 1 row with the Color column of red and the Fruit column of appl since it doesn't match the dictionary similar to the below.

Color   Fruit
red     appl

Solution

  • color_fruit = pd.Series((tuple(x) for x in filtered_df.values), index=filtered_df.index)
    result = filtered_df[~color_fruit.isin(pairs.items())]
    

    The explanation:

    In the first row we create a series of tuples (pairs) from columns of original dataframe, with the same index.

    Then we use it to filter original dataframe, selecting only rows which do not (~) satisfy the condition being a member (.isin()) of pairs (key, value) from the pairs dictionary.