I have a Pandas dataframe and am working in a Jupyter notebook. I want to highlight rows in which column pairs are duplicated. Here is an example:
colA = list(range(1,6))
colB = ['aa', 'bb', 'aa', 'cc', 'aa']
colC = [14,3,14,9,12]
colD = [108, 2001, 152, 696, 696]
df = pd.DataFrame(list(zip(colA, colB, colC, colD)), columns =['colA', 'colB', 'colC', 'colD'])
display(df)
I want to highlight these rows, because the values in colB and colC are duplicates:
I am trying this lambda function, but it throws an error (and it's only for one column):
df.style.applymap(lambda x: 'background-color : yellow' if x[colB].duplicated(keep=False) else '')
TypeError: ("'int' object is not subscriptable", 'occurred at index colA')
Thanks for any help
Personally, I would break the problem into two steps rather than use one complicated lambda function. We can find the index of all the duplicate rows, then highlight the rows by index number. Also don't forget that in your lambda function, you should use a list comprehension in what you are returning.
rows_series = df[['colB','colC']].duplicated(keep=False)
rows = rows_series[rows_series].index.values
df.style.apply(lambda x: ['background: yellow' if x.name in rows else '' for i in x], axis=1)