Search code examples
pythonnlpnltkpos-taggertext-chunking

How to extract special characters using NLTK RegexpParser Chunk for POS_tagged words in Python


I have some text for example say: 80% of $300,000 Each Human Resource/IT Department.

I would need to extract $300,000 along with the words Each Human Resource/IT Department

I have used pos tagging to tag the words after tokenizing. I was able to extract 300,000 but not able to extract $ sign along with it.

What I have so far:

text = '80% of $300,000 Each Human Resource/IT Department'
train_text = text
sample_text = text
custom_sent_tokenizer = PunktSentenseTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)

for i in tokenized:
    words = nltk.word_tokenize(i)
    tagged = nltk.pos_tag(words)

chunkGram = r"""chunk: {<DT>+<NN.*>+<NN.*>?|<NNP>?|<CD>+<NN>?|<NNP>?}"""


chunkParser = nltk.RegexpParser(chunkGram)
chunked = chunkParser.parse(tagged)

chunked output when coverted to list - ['80 %', '300,000', 'Each Human Resource/IT Department']

What I wanted : ['80 %', '**$**300,000', 'Each Human Resource/IT Department']

I tried

chunkGram = r"""chunk: {**</$CD>|**<DT>+<NN.*>+<NN.*>?|<NNP>?|<CD>+<NN>?|?}"""

It still doesn't work. So, all I need is a $ along with CD


Solution

  • You need to add <\$>? in your grammar.

    chunkGram = r"""chunk: {<DT>+<NN.*>+<NN.*>?|<\$>?<CD>+<NN>?|<NNP>?}"""
    

    Code :

    import nltk
    from nltk.tokenize import PunktSentenceTokenizer
    
    text = '80% of $300,000 Each Human Resource/IT Department'
    train_text = text
    sample_text = text
    custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
    tokenized = custom_sent_tokenizer.tokenize(sample_text)
    
    for i in tokenized:
        words = nltk.word_tokenize(i)
        tagged = nltk.pos_tag(words)
    
    chunkGram = r"""chunk: {<DT>+<NN.*>+<NN.*>?|<\$>?<CD>+<NN>?|<NNP>?}"""
    
    chunkParser = nltk.RegexpParser(chunkGram)
    chunked = chunkParser.parse(tagged)
    
    print(chunked)
    

    Output :

    (S
      (chunk 80/CD %/NN)
      of/IN
      (chunk $/$ 300,000/CD)
      (chunk Each/DT Human/NNP Resource/IT/NNP Department/NNP))