Search code examples
pythonpandasmasking

comparing separate dataframes in python removing values


I have two multi-index dataframes: data1 and data2 that I am running comparisons on.

Content of dataset data1 =

    Personalities   Rating
Type
        warm         5
Cat    caring        7
      frightful      9
       happy         3

Content of dataset data2 =

    Personalities   Rating
Type
          mean        3
Dog    ferocious      8
         loyal        4
         happy        1
          warm        6

I want to use a masking operation to identify all of the rows that do not have the same values in the personality column in both dataframes (all unique personality values).

Then, I need to remove all of those rows from the dataframe, so both dataframes will have the same values in their personalities columns.

My attempt consists in: new_data1['Personalities'].isin(new_data2['Personalities']).any(axis=1)

Result dataset of new_data1 =

    Personalities   Rating
Type
        warm         5
Cat    

       happy         3

Result dataset of new_data2

    Personalities   Rating
Type
          
Dog    

         happy        1
          warm        6

I'd like to create a new dataframe with the unique values where it would look something like this: in unique_data =

Personalities   Rating
Type

Cat    caring        7
      frightful      9
Dog    mean          3
      ferocious      8
      loyal          4
  

Solution

  • Input data:

    >>> df1
         Personalities  Rating
    Type
    Cat           warm       5
    Cat         caring       7
    Cat      frightful       9
    Cat          happy       3
    
    >>> df2
         Personalities  Rating
    Type
    Dog           mean       3
    Dog      ferocious       8
    Dog          loyal       4
    Dog          happy       1
    Dog           warm       6
    

    Prepare some sets:

    s1 = set(df1["Personalities"])
    s2 = set(df2["Personalities"])
    

    Now, you can extract the data you need:

    new_data1 = df1.loc[df1["Personalities"].isin(s1.intersection(s2))]
    
    new_data2 = df2.loc[df2["Personalities"].isin(s2.intersection(s1))]
    
    unique_data = pd.concat([df1.loc[df1["Personalities"].isin(s1.difference(s2))],
                             df2.loc[df2["Personalities"].isin(s2.difference(s1))]])
    
    >>> new_data1
         Personalities  Rating
    Type
    Cat           warm       5
    Cat          happy       3
    
    >>> new_data2
         Personalities  Rating
    Type
    Dog          happy       1
    Dog           warm       6
    
    >>> unique_data
         Personalities  Rating
    Type
    Cat         caring       7
    Cat      frightful       9
    Dog           mean       3
    Dog      ferocious       8
    Dog          loyal       4