How to efficiently find a string that is contained in another string from a different group in pandas

given the following table sample:

identifier	matched_string
occupation	manager
occupation	manager
skill	manager
department	marketing manager
skill	marketing

I would like to find the cases where a matched_string is contained in another 'matched_string' and have a different identifier. In the end I would like to get a dataframe that lists these matches.

I have this code so far but I feel it is super inefficient:

#df5 is the dataframe I want to search through

df4 = pd.DataFrame()

#find overlaps from matched_strings -> where matched string is contained in another matched string and has a different identifier
for index, row in df5.iterrows():
    for index2, row2 in df5.iterrows():
        if row["matched_string"] in row2["matched_string"] and row["identifier"] != row2["identifier"]:
            df4 = df4.append(row1)
            df4 = df4.append(row2)

Would someone have an idea on how to make this more efficient?

Solution

Wouldn't an aggregation as set per matched_string work for you?

df.groupby('matched_string')['identifier'].agg(set)

Output:

matched_string
manager              {occupation, skill}
marketing                        {skill}
marketing manager           {department}
Name: identifier, dtype: object

If you want to use it to filter your DataFrame and keep only the rows with multiple identifiers per matched_string:

s = df.groupby('matched_string')['identifier'].agg(set)

idx = s[s.str.len().gt(1)].index

out = df.loc[df['matched_string'].isin(idx)]

Output:

   identifier matched_string
0  occupation        manager
1  occupation        manager
2       skill        manager