I have a list, L:
L = ['top', 'left', 'behind', 'before', 'right', 'after', 'hand', 'side']
I have a pandas DataFrame, DF:
Text |
---|
the objects are both before and after the person |
the object is behind the person |
the object in right is next to top left hand side of person |
I would like to extract all words in L from the DF column 'Text' in such a manner:
Text | Extracted_Value |
---|---|
the objects are both before and after the person | before_after |
the object is behind the person | behind |
the object in right is next to top left hand side of person | right_top left hand side |
For case 1 and 2, my code is working:
L = ['top', 'left', 'behind', 'before', 'right', 'after', 'hand', 'side']
pattern = r"(?:^|\s+)(" + "|".join(L) + r")(?:\s+|$)"
df["Extracted_Value "] = (
df['Text'].str.findall(pattern).str.join("_").replace({"": None})
)
For CASE 3, I get right_top_hand
.
As in the third example, If identified words are contiguous, they are to be picked up as a phrase (one extraction). So in the object in right is next to top left hand side of person, there are two extractions - right and top left hand side. Hence, only these two extractions are separated by an _
.
I am not sure how to get it to work!
Try:
df["Extracted_Value"] = (
df.Text.apply(
lambda x: "|".join(w if w in L else "" for w in x.split()).strip("|")
)
.replace(r"\|{2,}", "_", regex=True)
.str.replace("|", " ", regex=False)
)
print(df)
Prints:
Text Extracted_Value
0 the objects are both before and after the person before_after
1 the object is behind the person behind
2 the object in right is next to top left hand side of person right_top left hand side
EDIT: Adapting @Wiktor's answer to pandas:
pattern = fr"\b((?:{'|'.join(L)})(?:\s+(?:{'|'.join(L)}))*)\b"
df["Extracted_Value"] = (
df["Text"].str.extractall(pattern).groupby(level=0).agg("_".join)
)
print(df)