I have two multi-index dataframes: data1
and data2
that I am running comparisons on.
Content of dataset data1 =
Personalities Rating
Type
warm 5
Cat caring 7
frightful 9
happy 3
Content of dataset data2 =
Personalities Rating
Type
mean 3
Dog ferocious 8
loyal 4
happy 1
warm 6
I want to use a masking operation to identify all of the rows that do not have the same values in the personality column in both dataframes (all unique personality values).
Then, I need to remove all of those rows from the dataframe, so both dataframes will have the same values in their personalities columns.
My attempt consists in: new_data1['Personalities'].isin(new_data2['Personalities']).any(axis=1)
Result dataset of new_data1 =
Personalities Rating
Type
warm 5
Cat
happy 3
Result dataset of new_data2
Personalities Rating
Type
Dog
happy 1
warm 6
I'd like to create a new dataframe with the unique values where it would look something like this:
in unique_data =
Personalities Rating
Type
Cat caring 7
frightful 9
Dog mean 3
ferocious 8
loyal 4
Input data:
>>> df1
Personalities Rating
Type
Cat warm 5
Cat caring 7
Cat frightful 9
Cat happy 3
>>> df2
Personalities Rating
Type
Dog mean 3
Dog ferocious 8
Dog loyal 4
Dog happy 1
Dog warm 6
Prepare some sets:
s1 = set(df1["Personalities"])
s2 = set(df2["Personalities"])
Now, you can extract the data you need:
new_data1 = df1.loc[df1["Personalities"].isin(s1.intersection(s2))]
new_data2 = df2.loc[df2["Personalities"].isin(s2.intersection(s1))]
unique_data = pd.concat([df1.loc[df1["Personalities"].isin(s1.difference(s2))],
df2.loc[df2["Personalities"].isin(s2.difference(s1))]])
>>> new_data1
Personalities Rating
Type
Cat warm 5
Cat happy 3
>>> new_data2
Personalities Rating
Type
Dog happy 1
Dog warm 6
>>> unique_data
Personalities Rating
Type
Cat caring 7
Cat frightful 9
Dog mean 3
Dog ferocious 8
Dog loyal 4