Search code examples
pythonnlpentity

How to create new entity and use it to find the entity in my test data? How to make my tokenize works?


I would like to make a new entity: let's call it "medicine" and then train it using my corpora. From there, identify all the entities of "medicine". Somehow my code is not working, could anyone help me?

import nltk


test= input("Please enter your file name")
test1= input("Please enter your second file name")

with open(test, "r") as file:  
    new = file.read().splitlines()


with open(test1, "r") as file2:
    new1= file2.read().splitlines()


for s in new:
    for x in new1:
        sample = s.replace('value', x)

        sample1 = ''.join(str(v) for v in sample)

        print(sample1)


        sentences = nltk.sent_tokenize(sample1)
        tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
        tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
        chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)


        print(sentences)

def extract_entity_names(t):
    entity_names = []

    if hasattr(t, 'label') and t.label:
        if t.label() == 'NE':
            entity_names.append(' '.join([child[0] for child in t]))
        else:
            for child in t:
                entity_names.extend(extract_entity_names(child))

    return entity_names

Solution

  • How to create new entity and use it to find the entity in my test data?

    Named entity recognizers are probabilistic, neural or linear models. In your code,

    chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)
    

    does this prediction. So if you want it to recognize new entity types you should first train a classifier on annotated data, containing the new entity type.

    Somehow my code is not working,

    As I said before, you did not train the model of NLTK with your own data, so it is not working.

    How to make my tokenize works?

    Tokenizer only extracts word tokens, which is done in your code by this line

    tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
    

    But, tokenizer does not predict named entity directly.

    If you want to train a model to predict custom named entity like medicine using NLTK, then try this tutorial.

    From my personal experience NLTK may not be suitable for this, look at Spacy.