I have this dataframe. If the Description is the same then the job entry should be exactly the same.
mycol = ['Title', 'Location', 'Company', 'Salary', 'Sponsored', 'Description']
mylist=[('a', 'b', 'c', 'd', 'e', 'f'),
('a', 'b', 'c', 'd2', 'e', 'f'),
('g', 'h', 'i', 'j', 'k', 'l' ),
('g1', 'h', 'i', 'j', 'k', 'l' ),
('n', 'o', 'p', 'q', 'r', 's'),
('n1', 'o', 'p', 'q', 'r', 's')
]
df = pd.DataFrame(mylist, columns = mycol)
I will like to highlight the differences in yellow background as shown in this image...
Is it possible in pandas?
Or I can export in excel and process it using VBA. I am trying to achieve this in pandas and then export to excel along with formatting.
Update:
Someone suggested to use this:
# Select all Columns but Description
cols = df.columns.symmetric_difference(['Description'])
# Clear All columns where Description is duplicated
df.loc[df['Description'].duplicated(), cols] = np.nan
# Fill foward over the blanks
df = df.ffill()
But it will replace the values and not highlight it.
We can clear the rows where the description is duplicated
, then use groupby ffill
to fill the values forward per Description:
mask = df.copy(deep=True)
# Select all Columns but Description
cols = mask.columns.symmetric_difference(['Description'])
# Clear All columns where Description is duplicated
mask.loc[mask['Description'].duplicated(), cols] = np.nan
# Fill foward over the blanks
mask = mask.groupby(df['Description'].values).ffill()
mask
:
Title Location Company Salary Sponsored Description
0 a b c d e f
1 a b c d e f
2 g h i j k l
3 g h i j k l
4 n o p q r s
5 n o p q r s
This can then become our point to compare against:
styles = (
# Remove Where values are incorrect
mask.where(mask.ne(df))
# Back fill per group
.groupby(df['Description'].values).bfill()
# Anywhere values are not null
.notnull()
# Replace booleans with styling
.replace({True: 'background-color: yellow;', False: ''})
)
df.style.apply(lambda _: styles, axis=None)
where
and groupby bfill
give us:
mask.where(mask.ne(df)).groupby(df['Description'].values).bfill()
Title Location Company Salary Sponsored Description
0 NaN NaN NaN d NaN NaN
1 NaN NaN NaN d NaN NaN
2 g NaN NaN NaN NaN NaN
3 g NaN NaN NaN NaN NaN
4 n NaN NaN NaN NaN NaN
5 n NaN NaN NaN NaN NaN
Then notnull
and replace
allow styles to be set:
styles
:
Title Location Company Salary Sponsored Description
0 background-color: yellow;
1 background-color: yellow;
2 background-color: yellow;
3 background-color: yellow;
4 background-color: yellow;
5 background-color: yellow;
Remember to write to_excel
from the Styler object not the DataFrame:
df.style.apply(lambda _: styles, axis=None).to_excel('out.xlsx')