Search code examples
pythonregexnlppunctuationtext-segmentation

text segmentation based on punctuation marks, especially at clause level


I want to segment the text when we encounter the punctuation mark in a sentence or paragraph. If I use comma(,) in my regex it is also chunking the individual nouns verbs or adjectives separated by comma. Suppose we have "dogs, cats, rats and other animals". Dogs becomes a separate chunk, which I do not want to happen. Is there anyway I can ignore that using regex or any other means in nltk where I can only get comma separated clause as a text segment

Code

from nltk import sent_tokenize
import re
text = "Peter Mattei's 'Love in the Time of Money' is a visually stunning film to watch. Mrs. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what money, power and success do to people in the different situation we encounter.
text= re.sub("(?<=..Dr|.Mrs|..Mr|..Ms|Prof)[.]","<prd>", text)
txt = re.split(r'\.\s|;|:|\?|\'\s|"\s|!|\s\'|\s\"', text)
print(txt)

Solution

  • This is too complicated to be solved with a regex: there is no way for the regex to know that there is a predicate (verb) within the clause candidate and if you expand it, you would break into another clause.

    The problem you are going to solve is called chunking in NLP. Traditionally, here were regex-based algorithms based on POS tags (so, you need to do POS tagging first). NLTK has a tutorial for that, however, this is a rather outdated approach.

    Now, when fast and reliable taggers and parsers are available (e.g., in Spacy). I would suggest analyzing the sentence first and then finding chunks in a constituency parse.