Search code examples
python-3.xnlpspacynamed-entity-recognition

How to prepare data for spacy's custom named entity recognition?


I'm trying to prepare a training dataset for custom named entity recognition using spacy. My data has a variable 'Text', which contains some sentences, a variable 'Names', which has names of people from the previous variable (sentences). After going through some examples and spacy's documentation, I realised that one has to pass index of the entity while preparing the dataset. I want to know if there's any way to pass the entity as a string directly while preparing the dataset ?

Reference: "https://medium.com/@manivannan_data/how-to-train-ner-with-custom-training-data-using-spacy-188e0e508c6"


Solution

  • No, spaCy will need exact start & end indices for your entity strings, since the string by itself may not always be uniquely identified and resolved in the source text. Examples:

    • Apple is usually an ORG, but can be a PERSON.
    • Ann is a PERSON, but not in Annotation tools are best for this purpose.

    In python, you can use the re module to grab the indices:

    >>> import re
    >>> [m.span() for m in re.finditer('Amazon', 'The Amazon is a river in South America.  Amazon Inc is a company.')]
    [(4, 10), (41, 47)]
    
    

    You will have to go through and verify the indices before creating your spaCy training set.