Search code examples
python-3.xnltknamed-entity-recognition

How to get information regarding population/Country in the text using NLTK package


I have text which has information about the population as well the country.I would like to get the NER for the population as well as the country.

My text is as follow:

text_sent = antigens in arterial occlusive diseases in japan.using a nih standard lymphocytotoxicity test, a possible japanese specific antigen, bjw 22.2 was identified in 17 out of 48 patients with thromboangiitis obliterans (35.4 per cent), in 5 out of 15 patients with takayasu's arteritis (33.3 per cent) and in 11 out of 113 normal controls (9.7 per cent).

I have tried using this

from nltk import word_tokenize, pos_tag, ne_chunk ne_chunk(pos_tag(word_tokenize(text_sent )))

i got the tagging but didnt get any GPE tagged word.

(S antigens/NNS in/IN arterial/JJ occlusive/JJ diseases/NNS in/IN japan.using/VBG a/DT nih/JJ standard/JJ lymphocytotoxicity/NN test/NN ,/, a/DT possible/JJ japanese/JJ specific/JJ antigen/NN ,/, bjw/JJ 22.2/CD was/VBD identified/VBN in/IN 17/CD out/IN of/IN 48/CD patients/NNS with/IN thromboangiitis/NN obliterans/NNS (/( 35.4/CD per/IN cent/NN )/) ,/, in/IN 5/CD out/IN of/IN 15/CD patients/NNS with/IN takayasu/NN 's/POS arteritis/NN (/( 33.3/CD per/IN cent/NN )/) and/CC in/IN 11/CD out/IN of/IN 113/CD normal/JJ controls/NNS (/( 9.7/CD per/IN cent/NN )/) ./.)


Solution

  • you are not getting GPE tagged because "japan.using" is not a name of geographical location instead it Should be Japan using

    I Have tried this using trained spacy model

    import spacy 
    nlp = spacy.load("en_core_web_sm")
    
    doc = nlp(u"antigens in arterial occlusive diseases in japan.using a nih standard lymphocytotoxicity test, a possible japanese specific antigen, bjw 22.2 was identified in 17 out of 48 patients with thromboangiitis obliterans (35.4 per cent), in 5 out of 15 patients with takayasu's arteritis (33.3 per cent) and in 11 out of 113 normal controls (9.7 per cent).")
    
    for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)
    
    #o/p
    japanese 106 114 NORP
    22.2 137 141 CARDINAL
    17 160 162 CARDINAL
    48 170 172 CARDINAL
    35.4 per cent 215 228 MONEY
    5 234 235 CARDINAL
    15 243 245 CARDINAL
    33.3 per cent 282 295 MONEY
    11 304 306 CARDINAL
    113 314 317 CARDINAL
    9.7 per cent 335 347 MONEY
    

    But when you modify 'japan.using' with 'Japan. using' you will get GPE tag

    Japan 43 48 GPE
    japanese 107 115 NORP
    22.2 138 142 CARDINAL
    17 161 163 CARDINAL
    48 171 173 CARDINAL
    35.4 per cent 216 229 MONEY
    5 235 236 CARDINAL
    15 244 246 CARDINAL
    33.3 per cent 283 296 MONEY
    11 305 307 CARDINAL
    113 315 318 CARDINAL
    9.7 per cent 336 348 MONEY