Search code examples
pythonregexnltkchunking

Chunking sentences using the word 'but' with RegEx


I am attempting to chunk sentences using RegEx at the word 'but' (or any other coordinating conjunction words). It's not working...

sentence = nltk.pos_tag(word_tokenize("There are no large collections present but there is spinal canal stenosis."))
result = nltk.RegexpParser(grammar).parse(sentence)
DigDug = nltk.RegexpParser(r'CHUNK: {.*<CC>.*}')
for subtree in DigDug.parse(sentence).subtrees(): 
    if subtree.label() == 'CHUNK': print(subtree.node())

I need to split the sentence "There are no large collections present but there is spinal canal stenosis." into two:

1. "There are no large collections present"
2. "there is spinal canal stenosis."

I also wish to use the same code to split sentences at 'and' and other coordinating conjunction (CC) words. But my code isn't working. Please help.


Solution

  • I think you can simply do

    import re
    result = re.split(r"\s+(?:but|and)\s+", sentence)
    

    where

    `\s`        Match a single character that is a "whitespace character" (spaces, tabs, line breaks, etc.)
    `+`         Between one and unlimited times, as many times as possible, giving back as needed (greedy)
    `(?:`       Match the regular expression below, do not capture
                Match either the regular expression below (attempting the next alternative only if this one fails)
      `but`     Match the characters "but" literally
      `|`       Or match regular expression number 2 below (the entire group fails if this one fails to match)
      `and`     Match the characters "and" literally
    )
    `\s`        Match a single character that is a "whitespace character" (spaces, tabs, line breaks, etc.)
    `+`         Between one and unlimited times, as many times as possible, giving back as needed (greedy)
    

    You can add more conjunction words in there separated by a pipe-character |. Take care though that these words do not contain characters that have special meaning in regex. If in doubt, escape them first with re.escape(word)