Extract all phrases from a pandas dataframe based on multiple words in list

I have a list, L:

L = ['top', 'left', 'behind', 'before', 'right', 'after', 'hand', 'side']

I have a pandas DataFrame, DF:

Text
the objects are both before and after the person
the object is behind the person
the object in right is next to top left hand side of person

I would like to extract all words in L from the DF column 'Text' in such a manner:

Text	Extracted_Value
the objects are both before and after the person	before_after
the object is behind the person	behind
the object in right is next to top left hand side of person	right_top left hand side

For case 1 and 2, my code is working:

L = ['top', 'left', 'behind', 'before', 'right', 'after', 'hand', 'side']
pattern = r"(?:^|\s+)(" + "|".join(L) + r")(?:\s+|$)"
df["Extracted_Value "] = (
    df['Text'].str.findall(pattern).str.join("_").replace({"": None})
)

For CASE 3, I get right_top_hand.

As in the third example, If identified words are contiguous, they are to be picked up as a phrase (one extraction). So in the object in right is next to top left hand side of person, there are two extractions - right and top left hand side. Hence, only these two extractions are separated by an _.

I am not sure how to get it to work!

Solution

Try:

df["Extracted_Value"] = (
    df.Text.apply(
        lambda x: "|".join(w if w in L else "" for w in x.split()).strip("|")
    )
    .replace(r"\|{2,}", "_", regex=True)
    .str.replace("|", " ", regex=False)
)
print(df)

Prints:

                                                          Text           Extracted_Value
0             the objects are both before and after the person              before_after
1                              the object is behind the person                    behind
2  the object in right is next to top left hand side of person  right_top left hand side

EDIT: Adapting @Wiktor's answer to pandas:

pattern = fr"\b((?:{'|'.join(L)})(?:\s+(?:{'|'.join(L)}))*)\b"

df["Extracted_Value"] = (
    df["Text"].str.extractall(pattern).groupby(level=0).agg("_".join)
)
print(df)