Search code examples
pythonpandastextsplitnlp

split dataframe text column with given strings - NLP


I have a dataframe with a text column in this form:

column_description

"this section includes: animals: cats and dogs and vegetables but doesn’t include: plants and fruits: coco"
"this section includes the following: axis: x and y but doesn’t include: z, k and c"
"this section includes notably: letters: a, b and c however it doesn’t include: y and letter: z"

I want to separate the text within the column and get two new columns like the following:

column_include

"animals: cats and dogs and vegetables"
"axis: x and y"
"letters: a, b and c "
column_exclude

"plants and fruits: coco"
"z, k and c"
"y and letter: z"

How can I achieve this with Python libraries? maybe using NLP techniques?


Solution

  • Here is a generic regex that works on your 3 cases:

    regex = r'this section includes[^:]*: (.*) (?:but|however it) doesn’t include: (.*)'
    
    df[['column_include', 'column_exclude']] = \
    df['column_description'].str.extract(regex)
    

    Output:

                                      column_description                         column_include           column_exclude
    0  this section includes: animals: cats and dogs ...  animals: cats and dogs and vegetables  plants and fruits: coco
    1  this section includes the following: axis: x a...                          axis: x and y               z, k and c
    2  this section includes notably: letters: a, b a...                    letters: a, b and c          y and letter: z
    

    regex demo

    If you want to make the second part optional, use a non-greedy quantifier (*?), an optional group ((?:...)) and an end of line anchor ($):

    regex = r'^this section\s*(?:includes[^:]*: (.*?))?\s*(?:(?:but|however it)? doesn’t include: (.*))?$'
    

    regex demo