Search code examples
pythonpandaslist-comprehension

How to efficiently find a string that is contained in another string from a different group in pandas


given the following table sample:

identifier matched_string
occupation manager
occupation manager
skill manager
department marketing manager
skill marketing

I would like to find the cases where a matched_string is contained in another 'matched_string' and have a different identifier. In the end I would like to get a dataframe that lists these matches.

I have this code so far but I feel it is super inefficient:

#df5 is the dataframe I want to search through

df4 = pd.DataFrame()

#find overlaps from matched_strings -> where matched string is contained in another matched string and has a different identifier
for index, row in df5.iterrows():
    for index2, row2 in df5.iterrows():
        if row["matched_string"] in row2["matched_string"] and row["identifier"] != row2["identifier"]:
            df4 = df4.append(row1)
            df4 = df4.append(row2)      

Would someone have an idea on how to make this more efficient?


Solution

  • Wouldn't an aggregation as set per matched_string work for you?

    df.groupby('matched_string')['identifier'].agg(set)
    

    Output:

    matched_string
    manager              {occupation, skill}
    marketing                        {skill}
    marketing manager           {department}
    Name: identifier, dtype: object
    

    If you want to use it to filter your DataFrame and keep only the rows with multiple identifiers per matched_string:

    s = df.groupby('matched_string')['identifier'].agg(set)
    
    idx = s[s.str.len().gt(1)].index
    
    out = df.loc[df['matched_string'].isin(idx)]
    

    Output:

       identifier matched_string
    0  occupation        manager
    1  occupation        manager
    2       skill        manager