Search code examples
pandasstringdataframehierarchystring-matching

Identifying Correct String Order in Pandas


I have a dataframe as the following, showing the relationship of different entities in each row.

Child Parent Ult_Parent Full_Family
A032 A001 A039 A001, A032, A039, A040, A041, A043, A043, A045, A046

In the "Full_Family" column, it's showing the correct hierarchy for the whole family tree from small to big, not just entities in that specific row. But to be aware of, not every entity in the "Full_Family" will appear in the "Child/Parent/Ult_Parent", as the "Full_Family" data is coming from the other source.

Here I have 2 problems need to solve:

  1. The Child/Parent/Ult_Parent order is not correct, how to based on the "Full_Family" column to identify the correct order?
  2. If possible, Can I also identify the correct Ult_Parent for a family in the whole table instead of each row. This correct Ult_Parent needs to be the entity that has appeared in Child/Parent/Ult_Parent columns.

Below is the example and ideal outcome:

Child Parent Ult_Parent Full_Family Correct_Order Correct_Ult_Parent_per_Family
A032 A001 A039 A001, A032, A039, A040, A041, A043, A043, A045, A046 A001, A032, A039 A043
A001 A043 A039 A001, A032, A039, A040, A041, A043, A043, A045, A046 A001, A039, A043 A043

Although A046 is the ultimate parent in this family, but it does not appear in this dataframe's Child/Parent/Ult_Parent, so the correct ultimate parent per family is A043 in this case.

Appreciate the help.


Solution

  • IIUC, assuming this dataframe as input:

    import pandas as pd
    
    data = {
        "Child": ["A032", "A001"],
        "Parent": ["A001", "A043"],
        "Ult_Parent": ["A039", "A039"],
        "Full_Family": [
            "A001, A032, A039, A040, A041, A043, A043, A045, A046",
            "A001, A032, A039, A040, A041, A043, A043, A045, A046",
        ],
    }
    
    df = pd.DataFrame(data)
    
      Child Parent Ult_Parent                                        Full_Family
    0  A032   A001       A039  A001, A032, A039, A040, A041, A043, A043, A045...
    1  A001   A043       A039  A001, A032, A039, A040, A041, A043, A043, A045...
    

    You can use this approach:

    df["Correct_Order"] = df.apply(
        lambda row: ", ".join(sorted([row["Parent"], row["Child"], row["Ult_Parent"]])),
        axis=1,
    )
    
    df["Correct_Ult_Parent_per_Family"] = (
        df[["Parent", "Child", "Ult_Parent"]].max(axis=1).max()
    )
    
      Child Parent Ult_Parent                                        Full_Family     Correct_Order Correct_Ult_Parent_per_Family
    0  A032   A001       A039  A001, A032, A039, A040, A041, A043, A043, A045...  A001, A032, A039                          A043
    1  A001   A043       A039  A001, A032, A039, A040, A041, A043, A043, A045...  A001, A039, A043                          A043
    

    If 'Full_Family' is not necessarily in ascending order, and you want to respect its order, you can define a custom key to sorted.

    For example, if A039 comes before A032 in 'Full_Family' in the first row:

    data = {
        "Child": ["A032", "A001"],
        "Parent": ["A001", "A043"],
        "Ult_Parent": ["A039", "A039"],
        "Full_Family": [
            "A001, A039, A032, A040, A041, A043, A043, A045, A046",
            "A001, A032, A039, A040, A041, A043, A043, A045, A046",
        ],
    }
    
    df = pd.DataFrame(data)
    

    Using a custom key:

    df["Correct_Order"] = df.apply(
        lambda row: ", ".join(
            sorted(
                [row["Parent"], row["Child"], row["Ult_Parent"]],
                key=lambda x: {
                    val: idx for idx, val in enumerate(row["Full_Family"].split(", "))
                }[x],
            )
        ),
        axis=1,
    )
    
    df["Correct_Ult_Parent_per_Family"] = df["Correct_Order"].str.split().str[-1].max()
    
      Child Parent Ult_Parent                                        Full_Family     Correct_Order Correct_Ult_Parent_per_Family
    0  A032   A001       A039  A001, A039, A032, A040, A041, A043, A043, A045...  A001, A039, A032                          A043
    1  A001   A043       A039  A001, A032, A039, A040, A041, A043, A043, A045...  A001, A039, A043                          A043