Search code examples
pythonpandasregextext-extraction

Extract all phrases from a pandas dataframe based on multiple words in list


I have a list, L:

L = ['top', 'left', 'behind', 'before', 'right', 'after', 'hand', 'side']

I have a pandas DataFrame, DF:

Text
the objects are both before and after the person
the object is behind the person
the object in right is next to top left hand side of person

I would like to extract all words in L from the DF column 'Text' in such a manner:

Text Extracted_Value
the objects are both before and after the person before_after
the object is behind the person behind
the object in right is next to top left hand side of person right_top left hand side

For case 1 and 2, my code is working:

L = ['top', 'left', 'behind', 'before', 'right', 'after', 'hand', 'side']
pattern = r"(?:^|\s+)(" + "|".join(L) + r")(?:\s+|$)"
df["Extracted_Value "] = (
    df['Text'].str.findall(pattern).str.join("_").replace({"": None})
)

For CASE 3, I get right_top_hand.

As in the third example, If identified words are contiguous, they are to be picked up as a phrase (one extraction). So in the object in right is next to top left hand side of person, there are two extractions - right and top left hand side. Hence, only these two extractions are separated by an _.

I am not sure how to get it to work!


Solution

  • Try:

    df["Extracted_Value"] = (
        df.Text.apply(
            lambda x: "|".join(w if w in L else "" for w in x.split()).strip("|")
        )
        .replace(r"\|{2,}", "_", regex=True)
        .str.replace("|", " ", regex=False)
    )
    print(df)
    

    Prints:

                                                              Text           Extracted_Value
    0             the objects are both before and after the person              before_after
    1                              the object is behind the person                    behind
    2  the object in right is next to top left hand side of person  right_top left hand side
    

    EDIT: Adapting @Wiktor's answer to pandas:

    pattern = fr"\b((?:{'|'.join(L)})(?:\s+(?:{'|'.join(L)}))*)\b"
    
    df["Extracted_Value"] = (
        df["Text"].str.extractall(pattern).groupby(level=0).agg("_".join)
    )
    print(df)