Search code examples
pythonexcelautomation

How do you compare multiple columns in Excel using Python?


I've got an excel sheet that has values such as:

ProductID Reference Component used DESCRIPTION
5325646 PD-42#23 ASDGE34#W-AL HOUSING PLATE
4563756 ITA-COPPER RAW AL WASHER
4345432 ASDGE34#W-AL ZK23456 3MM X 2MM X 10MM COPPER BAR
1245456 ASDGE34#W-AL ZK66634 1MM X 2MM X 9MM COPPER BAR

What I essentially want to do is match anything in "Component" with "Reference" and if they come true, like in this example, ASDGE34#W-AL matches in "Component" and in "Reference", so I'd want to copy the details over and shift it after "Description". The end result that I want to achieve is this.

ProductID Reference Component used DESCRIPTION Child Component ID DESCRIPTION
5325646 PD-42#23 ASDGE34#W-AL HOUSING PLATE ZK23456 & ZK654534 3MM X 2MM X 10MM COPPER BAR &1MM X 2MM X 9MM COPPER BAR
4563756 ITA-COPPER ZAAGD0545 ALUMINIUM BAR
5656565 0788866-7#D YN435611 DUSTER

Can anyone help with this? I've been using Python to try and iterate a loop over the "Components used" and the "Reference" column seeing what match but no luck so far.


Solution

  • I have code that does something similar. I've changed it to suit your problem.

    The products are stored in 2 separate dict lists, accessed by Reference and Component used. You can easily iterate over the products_by_reference dict using a Component used value to get a list of products with a matching reference.

    Note that this implementation assumes each column header is unique, as each row is stored as a dict where each column header is used as the key to access the corresponding value, so if a header is used twice, the first value will be overwritten. Because of this, I manually rename the header Child DESCRIPTION to DESCRIPTION after writing the header columns.

    import os
    from collections import defaultdict
    import openpyxl
    
    # Load the existing spreadsheet.
    wb = openpyxl.load_workbook(filename="Products.xlsx")
    ws = wb.worksheets[0]
    
    # Style used for the header row
    bold_font = openpyxl.styles.Font(bold=True)
    grey_fill = openpyxl.styles.fills.PatternFill("solid", fgColor="F8F9F9")
    
    # Store the header column names and numbers.
    # Key = column name, value = column number. e.g. "ProductID": 1
    header_columns = {}
    for column_num in range(1, ws.max_column + 1):
        column_name = ws.cell(row=1, column=column_num).value
        header_columns[column_name] = column_num
    
    
    # Read in all the values, starting after the header row.
    # Store each product as a dict, and store every product in 2 separate 
    # dict of lists, with the keys being the reference and component used.
    products_by_reference = defaultdict(list)
    products_by_component = defaultdict(list)
    for row_num in range(2, ws.max_row + 1):
        product = {}
        for column_name, column_num in header_columns.items():
            product[column_name] = ws.cell(row=row_num, column=column_num).value
        
        reference = product["Reference"]
        component = product["Component used"]
        products_by_reference[reference].append(product)
        products_by_component[component].append(product)
    
    # Iterate over the products by component, and copy and join all the 
    # components used and descriptions to the relevant product's dict.
    for component, products in products_by_component.items():
        # Use dicts here instead of sets if preserving the order is important.
        child_components = set()
        child_descriptions = set()
        for product in products_by_reference[component]:
            child_components.add(product["Component used"])
            child_descriptions.add(product["DESCRIPTION"])
        
        child_components_string = " & ".join(child_components)
        child_descriptions_string = " & ".join(child_descriptions)
        for product in products:
            product["Child Component ID"] = child_components_string
            product["Child DESCRIPTION"] = child_descriptions_string
        
        # products_by_reference will be used later for the new spreadsheet, and 
        # we don't want to include any products that have a reference which is a 
        # component used by another product, since they have now been merged with 
        # that other product.
        del products_by_reference[component]
    
    # Write the products to a new spreadsheet.
    wb = openpyxl.Workbook()
    ws = wb.active
    ws.title = "Products"
    
    # Freeze the header row.
    ws.freeze_panes = "A2"
    
    # Add the extra headers.
    column_num = len(header_columns)
    header_columns["Child Component ID"] = column_num + 1
    header_columns["Child DESCRIPTION"] = column_num + 2
    
    # Write the headers.
    for column_num, column_name in enumerate(header_columns, start=1):
        cell = ws.cell(row=1, column=column_num, value=column_name)
        cell.font = bold_font
        cell.fill = grey_fill
    
    # This implementation doesn't allow there to be duplicate headers, as the 
    # headers are used as keys to retrieve values from the product dicts. So the 
    # the last header is manually changed from "Child DESCRIPTION" to 
    # "DESCRIPTION" after it has been written to the sheet.
    cell.value = "DESCRIPTION"
    
    # Flatten the dictionary of lists, and write every product to the spreadsheet.
    products = [product for products_list in products_by_reference.values() for product in products_list]
    for row_num, product in enumerate(products, start=2):
        for column_name, column_num in header_columns.items():
            cell = ws.cell(row=row_num, column=column_num, value=product.get(column_name))
    
    # Save the spreadsheet to file, then launch it.
    filename = "Merged products.xlsx"
    wb.save(filename)
    os.startfile(filename)