I'm trying to prepare a training dataset for custom named entity recognition using spacy. My data has a variable 'Text', which contains some sentences, a variable 'Names', which has names of people from the previous variable (sentences). After going through some examples and spacy's documentation, I realised that one has to pass index of the entity while preparing the dataset. I want to know if there's any way to pass the entity as a string directly while preparing the dataset ?
Reference: "https://medium.com/@manivannan_data/how-to-train-ner-with-custom-training-data-using-spacy-188e0e508c6"
No, spaCy will need exact start & end indices for your entity strings, since the string by itself may not always be uniquely identified and resolved in the source text. Examples:
Apple
is usually an ORG, but can be a PERSON.Ann
is a PERSON, but not in Annotation tools are best for this purpose.
In python, you can use the re module to grab the indices:
>>> import re
>>> [m.span() for m in re.finditer('Amazon', 'The Amazon is a river in South America. Amazon Inc is a company.')]
[(4, 10), (41, 47)]
You will have to go through and verify the indices before creating your spaCy training set.