Search code examples
pandaspandas-styles

Highlight the cells in different color if not exact dup


I have this dataframe. If the Description is the same then the job entry should be exactly the same.

mycol = ['Title', 'Location', 'Company', 'Salary', 'Sponsored', 'Description']
mylist=[('a', 'b', 'c', 'd', 'e', 'f'),
('a', 'b', 'c', 'd2', 'e', 'f'),
('g', 'h', 'i', 'j', 'k', 'l' ),
('g1', 'h', 'i', 'j', 'k', 'l' ),
('n', 'o', 'p', 'q', 'r', 's'),
('n1', 'o', 'p', 'q', 'r', 's')
]

df = pd.DataFrame(mylist, columns = mycol)

I will like to highlight the differences in yellow background as shown in this image... enter image description here

Is it possible in pandas?

Or I can export in excel and process it using VBA. I am trying to achieve this in pandas and then export to excel along with formatting.


Update:

Someone suggested to use this:

# Select all Columns but Description
cols = df.columns.symmetric_difference(['Description'])
# Clear All columns where Description is duplicated
df.loc[df['Description'].duplicated(), cols] = np.nan
# Fill foward over the blanks
df = df.ffill()

But it will replace the values and not highlight it.


Solution

  • We can clear the rows where the description is duplicated, then use groupby ffill to fill the values forward per Description:

    mask = df.copy(deep=True)
    # Select all Columns but Description
    cols = mask.columns.symmetric_difference(['Description'])
    # Clear All columns where Description is duplicated
    mask.loc[mask['Description'].duplicated(), cols] = np.nan
    # Fill foward over the blanks
    mask = mask.groupby(df['Description'].values).ffill()
    

    mask:

      Title Location Company Salary Sponsored Description
    0     a        b       c      d         e           f
    1     a        b       c      d         e           f
    2     g        h       i      j         k           l
    3     g        h       i      j         k           l
    4     n        o       p      q         r           s
    5     n        o       p      q         r           s
    

    This can then become our point to compare against:

    styles = (
        # Remove Where values are incorrect
        mask.where(mask.ne(df))
            # Back fill per group
            .groupby(df['Description'].values).bfill()
            # Anywhere values are not null
            .notnull()
            # Replace booleans with styling
            .replace({True: 'background-color: yellow;', False: ''})
    )
    
    df.style.apply(lambda _: styles, axis=None)
    

    styled table


    where and groupby bfill give us:

    mask.where(mask.ne(df)).groupby(df['Description'].values).bfill()
    
      Title Location Company Salary Sponsored Description
    0   NaN      NaN     NaN      d       NaN         NaN
    1   NaN      NaN     NaN      d       NaN         NaN
    2     g      NaN     NaN    NaN       NaN         NaN
    3     g      NaN     NaN    NaN       NaN         NaN
    4     n      NaN     NaN    NaN       NaN         NaN
    5     n      NaN     NaN    NaN       NaN         NaN
    

    Then notnull and replace allow styles to be set: styles:

                           Title Location Company                     Salary Sponsored Description
    0                                              background-color: yellow;                      
    1                                              background-color: yellow;                      
    2  background-color: yellow;                                                                  
    3  background-color: yellow;                                                                  
    4  background-color: yellow;                                                                  
    5  background-color: yellow;                                                                  
    

    Remember to write to_excel from the Styler object not the DataFrame:

    df.style.apply(lambda _: styles, axis=None).to_excel('out.xlsx')