I have a dataframe with a text column in this form:
column_description
"this section includes: animals: cats and dogs and vegetables but doesn’t include: plants and fruits: coco"
"this section includes the following: axis: x and y but doesn’t include: z, k and c"
"this section includes notably: letters: a, b and c however it doesn’t include: y and letter: z"
I want to separate the text within the column and get two new columns like the following:
column_include
"animals: cats and dogs and vegetables"
"axis: x and y"
"letters: a, b and c "
column_exclude
"plants and fruits: coco"
"z, k and c"
"y and letter: z"
How can I achieve this with Python libraries? maybe using NLP techniques?
Here is a generic regex that works on your 3 cases:
regex = r'this section includes[^:]*: (.*) (?:but|however it) doesn’t include: (.*)'
df[['column_include', 'column_exclude']] = \
df['column_description'].str.extract(regex)
Output:
column_description column_include column_exclude
0 this section includes: animals: cats and dogs ... animals: cats and dogs and vegetables plants and fruits: coco
1 this section includes the following: axis: x a... axis: x and y z, k and c
2 this section includes notably: letters: a, b a... letters: a, b and c y and letter: z
If you want to make the second part optional, use a non-greedy quantifier (*?
), an optional group ((?:...)
) and an end of line anchor ($
):
regex = r'^this section\s*(?:includes[^:]*: (.*?))?\s*(?:(?:but|however it)? doesn’t include: (.*))?$'