split dataframe text column with given strings - NLP

I have a dataframe with a text column in this form:

column_description

"this section includes: animals: cats and dogs and vegetables but doesn’t include: plants and fruits: coco"
"this section includes the following: axis: x and y but doesn’t include: z, k and c"
"this section includes notably: letters: a, b and c however it doesn’t include: y and letter: z"

I want to separate the text within the column and get two new columns like the following:

column_include

"animals: cats and dogs and vegetables"
"axis: x and y"
"letters: a, b and c "

column_exclude

"plants and fruits: coco"
"z, k and c"
"y and letter: z"

How can I achieve this with Python libraries? maybe using NLP techniques?

Solution

Here is a generic regex that works on your 3 cases:

regex = r'this section includes[^:]*: (.*) (?:but|however it) doesn’t include: (.*)'

df[['column_include', 'column_exclude']] = \
df['column_description'].str.extract(regex)

Output:

                                  column_description                         column_include           column_exclude
0  this section includes: animals: cats and dogs ...  animals: cats and dogs and vegetables  plants and fruits: coco
1  this section includes the following: axis: x a...                          axis: x and y               z, k and c
2  this section includes notably: letters: a, b a...                    letters: a, b and c          y and letter: z

regex demo

If you want to make the second part optional, use a non-greedy quantifier (*?), an optional group ((?:...)) and an end of line anchor ($):

regex = r'^this section\s*(?:includes[^:]*: (.*?))?\s*(?:(?:but|however it)? doesn’t include: (.*))?$'

regex demo