Search code examples
regexpython-3.xspacynamed-entity-recognition

How to tag named entities to prepare training data for custom named entity recognition with spacy?


I want to train spacy named entity recognizer on my custom dataset. I have prepared a python dictionary having key = entity_type and list of values = entity name, but i'm not getting any way using which I can tag the tokens in proper format.

I have tried normal string matching(find) and regular expression(search, compile) but not getting what I want.

for ex: my sentence and the dict I'm using are(this is the example)

sentence = "Machine learning and data mining often employ the same methods
and overlap significantly."

dic = {'MLDM': ['machine learning and data mining'], 'ML': ['machine learning'],
 'DM': ['data mining']}

for k,v in dic.items():
  for val in v:
    if val in sentence:
      print(k, val, sentence.index(val)) #right now I'm just printing 
#the key, val and starting index

output:
MLDM machine learning and data mining 0
ML machine learning 0
DM data mining 21

expected output: MLDM 0 32

so I can further prepare training data to train Spacy NER : 
[{"content":"machine learning and data mining often employ the same methods 
and overlap significantly.","entities":[[0,32,"MLDM"]]}

Solution

  • You may build a regex from all values in your dic to match them as whole words and upon a match grab the key associated with the matched value. I assume the value items are unique in the dictionary, they can contain whitespaces and only contain "word" characters (no special ones like + or ().

    import re
    
    sentence = "Machine learning and data mining often employ the same methods and overlap significantly."
    
    dic = {'MLDM': ['machine learning and data mining'], 'ML': ['machine learning'],
     'DM': ['data mining']}
    
    def get_key(val):
        for k,v in dic.items():
            if m.group().lower() in map(str.lower, v):
                return k
        return ''
    
    # Flatten the lists in values and sort the list by length in descending order
    l=sorted([v for x in dic.values() for v in x], key=len, reverse=True)
    # Build the alternation based regex with \b to match each item as a whole word 
    rx=r'\b(?:{})\b'.format("|".join(l))
    for m in re.finditer(rx, sentence, re.I): # Search case insensitively
        key = get_key(m.group())
        if key:
            print("{} {}".format(key, m.start()))
    

    See the Python demo