Search code examples
pythonpandasdataframexlsx

How to remove duplicates in .xlsx and move values to new column with pandas


I have .xlsx sheet with multiple entries: Entries

what I try to achieve: result

Therefore I am really stuck. I have tried df.drop_duplicates() but it does not help. If you could share some ideas or where to look it would be awesome! TIA

P.S. sorry for the pictures, but it does not render correctly for some reason.


Solution

  • My strategy was to create and then join 2 dataframes with the index set to the Trigger column. One dataframe contains only the Trigger, To, and From columns, while the other dataframe has the Trigger, Category, and Description columns. The "trick" here is to use DataFrame.pivot to get the corresponding Category and Description values into their own columns.

    import pandas as pd
    import os
    
    filename = r'logfile.xlsx'
    new_filename = r'consolidated_logfile.xlsx'
    
    df = pd.read_excel(filename)
    
    # Create a dataframe with the actual records we want to be left with at the end
    # You might think of this as removing duplicates, in a way
    records = df.loc[df['String Value'].isna()]
    
    # Remove the 'String Value' and 'Remarks' columns
    records = records.drop(columns=['String Value', 'Remarks'])
    
    # Set the index to the 'Trigger' column, so another dataframe with the extra 
    # details ('Category' and 'Description' columns) can be joined/merged later on
    records = records.set_index('Trigger')
    
    # Create a dataframe with just the extra details ('Category' and 'Description' columns)
    details = df.loc[df['String Value'].notna()]
    
    # Transform the dataframe using DataFrame.pivot so that:
    # - the indexes are the values from the 'Trigger' column
    # - the columns are the values from the 'String Value' column
    # - the values are taken from the 'Remarks' column
    details = details.pivot(index='Trigger', columns='String Value', values='Remarks')
    
    # Combine the 2 dataframes with DataFrame.join
    combined = records.join(details)
    
    # Save to a new spreadsheet
    combined.to_excel(new_filename)
    
    # Launch the new spreadsheet
    os.startfile(new_filename)
    

    To make it easier to understand, here are the dataframes after each step:

    df = pd.read_excel(filename)
    
                       Trigger   From     To String Value                    Remarks
    0  incident: INC111111111    Tom  Peter          NaN                        NaN
    1      request: REQ123464   John   Alex          NaN                        NaN
    2   incident: INC22222222  David  Maria          NaN                        NaN
    3  incident: INC111111111    NaN    NaN     Category                       FYI 
    4      request: REQ123464    NaN    NaN     Category               Wrong Route 
    5   incident: INC22222222    NaN    NaN     Category                       FYI 
    6  incident: INC111111111    NaN    NaN  Description            Wrongly routed 
    7      request: REQ123464    NaN    NaN  Description  Dispatch to another team 
    8   incident: INC22222222    NaN    NaN  Description     Try more troubleshoot 
    
    # Create a dataframe with the actual records we want to be left with at the end
    # You might think of this as removing duplicates, in a way
    records = df.loc[df['String Value'].isna()]
    
                       Trigger   From     To String Value Remarks
    0  incident: INC111111111    Tom  Peter          NaN     NaN
    1      request: REQ123464   John   Alex          NaN     NaN
    2   incident: INC22222222  David  Maria          NaN     NaN
    
    # Remove the 'String Value' and 'Remarks' columns
    records = records.drop(columns=['String Value', 'Remarks'])
    
                       Trigger   From     To
    0  incident: INC111111111    Tom  Peter
    1      request: REQ123464   John   Alex
    2   incident: INC22222222  David  Maria
    
    # Set the index to the 'Trigger' column, so another dataframe with the extra 
    # details ('Category' and 'Description' columns) can be joined/merged later on
    records = records.set_index('Trigger')
    
                              From     To
    Trigger                             
    incident: INC111111111    Tom  Peter
    request: REQ123464       John   Alex
    incident: INC22222222   David  Maria
    
    # Create a dataframe with just the extra details ('Category' and 'Description' columns)
    details = df.loc[df['String Value'].notna()]
    
                       Trigger From   To String Value                    Remarks
    3  incident: INC111111111  NaN  NaN     Category                       FYI 
    4      request: REQ123464  NaN  NaN     Category               Wrong Route 
    5   incident: INC22222222  NaN  NaN     Category                       FYI 
    6  incident: INC111111111  NaN  NaN  Description            Wrongly routed 
    7      request: REQ123464  NaN  NaN  Description  Dispatch to another team 
    8   incident: INC22222222  NaN  NaN  Description     Try more troubleshoot 
    
    # Transform the dataframe using DataFrame.pivot so that:
    # - the indexes are the values from the 'Trigger' column
    # - the columns are the values from the 'String Value' column
    # - the values are taken from the 'Remarks' column
    details = details.pivot(index='Trigger', columns='String Value', values='Remarks')
    
     String Value                Category                Description
    Trigger                                                        
    incident: INC111111111          FYI             Wrongly routed 
    incident: INC22222222           FYI      Try more troubleshoot 
    request: REQ123464      Wrong Route   Dispatch to another team 
    
    # Combine the 2 dataframes with DataFrame.join
    combined = records.join(details)
    
                              From     To      Category                Description
    Trigger                                                                      
    incident: INC111111111    Tom  Peter          FYI             Wrongly routed 
    request: REQ123464       John   Alex  Wrong Route   Dispatch to another team 
    incident: INC22222222   David  Maria          FYI      Try more troubleshoot