I have some text for example say: 80% of $300,000 Each Human Resource/IT Department.
I would need to extract $300,000
along with the words Each Human Resource/IT Department
I have used pos tagging to tag the words after tokenizing. I was able to extract 300,000 but not able to extract $ sign along with it.
What I have so far:
text = '80% of $300,000 Each Human Resource/IT Department'
train_text = text
sample_text = text
custom_sent_tokenizer = PunktSentenseTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)
for i in tokenized:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
chunkGram = r"""chunk: {<DT>+<NN.*>+<NN.*>?|<NNP>?|<CD>+<NN>?|<NNP>?}"""
chunkParser = nltk.RegexpParser(chunkGram)
chunked = chunkParser.parse(tagged)
chunked output when coverted to list - ['80 %', '300,000', 'Each Human Resource/IT Department']
What I wanted : ['80 %', '**$**300,000', 'Each Human Resource/IT Department']
I tried
chunkGram = r"""chunk: {**</$CD>|**<DT>+<NN.*>+<NN.*>?|<NNP>?|<CD>+<NN>?|
?}"""
It still doesn't work. So, all I need is a $ along with CD
You need to add <\$>? in your grammar.
chunkGram = r"""chunk: {<DT>+<NN.*>+<NN.*>?|<\$>?<CD>+<NN>?|<NNP>?}"""
Code :
import nltk
from nltk.tokenize import PunktSentenceTokenizer
text = '80% of $300,000 Each Human Resource/IT Department'
train_text = text
sample_text = text
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)
for i in tokenized:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
chunkGram = r"""chunk: {<DT>+<NN.*>+<NN.*>?|<\$>?<CD>+<NN>?|<NNP>?}"""
chunkParser = nltk.RegexpParser(chunkGram)
chunked = chunkParser.parse(tagged)
print(chunked)
Output :
(S
(chunk 80/CD %/NN)
of/IN
(chunk $/$ 300,000/CD)
(chunk Each/DT Human/NNP Resource/IT/NNP Department/NNP))