Deleting and updating a string and entity index in a text document for NER training data

I am trying to create a training dataset for NER recognition. For that, I have huge amounts of data that need to be tagged and remove the unnecessary sentences. On removing the unnecessary sentence the index potion must be updated. Last day I saw some incredible code segments from some users about this which I cannot find now. Adapting their code segment I can brief my issue

Let's take a training sample data :

data = [{"content":'''Hello we are hans and john. I enjoy playing Football.
I love eating grapes. Hanaan is great.''',"annotations":[{"id":1,"start":13,"end":17,"tag":"name"},
                                {"id":2,"start":22,"end":26,"tag":"name"},
                                {"id":3,"start":68,"end":74,"tag":"fruit"},
                                {"id":4,"start":76,"end":82,"tag":"name"}]}]

This can be visualized using the following spacy display code

import json
import spacy
from spacy import displacy

data = [{"content":'''Hello we are hans and john. I enjoy playing Football.
I love eating grapes. Hanaan is great.''',"annotations":[{"id":1,"start":13,"end":17,"tag":"name"},
                                {"id":2,"start":22,"end":26,"tag":"name"},
                                {"id":3,"start":68,"end":74,"tag":"fruit"},
                                {"id":4,"start":76,"end":82,"tag":"name"}]}]

annot_tags = data[data_index]["annotations"]
entities = []
for j in annot_tags:
    start = j["start"]
    end = j["end"]
    tag = j["tag"]
    entitie = (start,end,tag)
    entities.append(entitie)
data_gen = (data[data_index]["content"],{"entities":entities})
data_one = []
data_one.append(data_gen)

nlp = spacy.blank('en')
raw_text = data_one[0][0]
doc = nlp.make_doc(raw_text)
spans = data_one[0][1]["entities"]
ents = []
for span_start, span_end, label in spans:
    ent = doc.char_span(span_start, span_end, label=label)
    if ent is None:
        continue

    ents.append(ent)

doc.ents = ents
displacy.render(doc, style="ent", jupyter=True)

The output will be

Output 1

Now I want to remove the sentence which is not tagged and update the index values. So the required output is like

Required Output

Also data must be in the following format. Untagged sentence is removed and index values must be updated so that I can get the output like above.

Required output data

[{"content":'''Hello we are hans and john.
I love eating grapes. Hanaan is great.''',"annotations":[{"id":1,"start":13,"end":17,"tag":"name"},
                                {"id":2,"start":22,"end":26,"tag":"name"},
                                {"id":3,"start":42,"end":48,"tag":"fruit"},
                                {"id":4,"start":50,"end":56,"tag":"name"}]}]

I was following a post last day and got a near working code.

Code

import re

data = [{"content":'''Hello we are hans and john. I enjoy playing Football.
I love eating grapes. Hanaan is great.''',"annotations":[{"id":1,"start":13,"end":17,"tag":"name"},
                                {"id":2,"start":22,"end":26,"tag":"name"},
                                {"id":3,"start":68,"end":74,"tag":"fruit"},
                                {"id":4,"start":76,"end":82,"tag":"name"}]}]
         
         
         
for idx, each in enumerate(data[0]['annotations']):
    start = each['start']
    end = each['end']
    word = data[0]['content'][start:end]
    data[0]['annotations'][idx]['word'] = word
    
sentences = [ {'sentence':x.strip() + '.','checked':False} for x in data[0]['content'].split('.')]

new_data = [{'content':'', 'annotations':[]}]
for idx, each in enumerate(data[0]['annotations']):
    for idx_alpha, sentence in enumerate(sentences):
        if sentence['checked'] == True:
            continue
        temp = each.copy()
        check_word = temp['word']
        if check_word in sentence['sentence']:
            start_idx = re.search(r'\b({})\b'.format(check_word), sentence['sentence']).start()
            end_idx = start_idx + len(check_word)
            
            current_len = len(new_data[0]['content'])
            
            new_data[0]['content'] += sentence['sentence'] + ' '
            temp.update({'start':start_idx + current_len, 'end':end_idx + current_len})
            new_data[0]['annotations'].append(temp)
            
            sentences[idx_alpha]['checked'] = True
            break
print(new_data)

Output

[{'content': 'Hello we are hans and john. I love eating grapes. Hanaan is great. ',
  'annotations': [{'id': 1,
    'start': 13,
    'end': 17,
    'tag': 'name',
    'word': 'hans'},
   {'id': 3, 'start': 42, 'end': 48, 'tag': 'fruit', 'word': 'grapes'},
   {'id': 4, 'start': 50, 'end': 56, 'tag': 'name', 'word': 'Hanaan'}]}]

Here the name john is lost. If more than one tag is present, I can't lose that too

Solution

It's a pretty complicated task, in that, you need to identify sentences, as doing a simple split on the '.' may not work as it'll split on things like 'Mr.', etc.

Since you are using spacy, why not let that identify sentences, then iterate through those and calculate out those start end indexes, and not include any sentence that doesn't have an entity. Then reconstruct the content.

import json
import spacy
from spacy import displacy
import re

data = [{"content":'''Hello we are hans and john. I enjoy playing Football. \
I love eating grapes. Hanaan is great. Mr. Jones is nice.''',"annotations":[{"id":1,"start":13,"end":17,"tag":"name"},
                                {"id":2,"start":22,"end":26,"tag":"name"},
                                {"id":3,"start":68,"end":74,"tag":"fruit"},
                                {"id":4,"start":76,"end":82,"tag":"name"},
                                {"id":5,"start":93,"end":102,"tag":"name"}]}]

for idx, each in enumerate(data[0]['annotations']):
    start = each['start']
    end = each['end']
    word = data[0]['content'][start:end]
    data[0]['annotations'][idx]['word'] = word
    
         
text = data[0]['content']

nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('sentencizer')

doc = nlp(text)
sentences = [i for i in doc.sents]
annotations = data[0]['annotations']

new_data = [{"content":'',
            'annotations':[]}]
for sentence in sentences:
    idx_to_remove = []
    for idx, annotation in enumerate(annotations):
        if annotation['word'] in sentence.text:
            temp = annotation.copy()
            
            start_idx = re.search(r'\b({})\b'.format(annotation['word']), sentence.text).start()
            end_idx = start_idx + len(annotation['word'])
            
            current_len = len(new_data[0]['content'])
            
            
            temp.update({'start':start_idx + current_len, 'end':end_idx + current_len})
            new_data[0]['annotations'].append(temp)
            
            idx_to_remove.append(idx)
            
    if len(idx_to_remove) > 0:
        new_data[0]['content'] += sentence.text + ' '
    for x in range(0,len(idx_to_remove)):
        del annotations[0]

Output:

print(new_data)
[{'content': 'Hello we are hans and john. I love eating grapes. Hanaan is great. Mr. Jones is nice. ', 
'annotations': [
{'id': 1, 'start': 13, 'end': 17, 'tag': 'name', 'word': 'hans'}, 
{'id': 2, 'start': 22, 'end': 26, 'tag': 'name', 'word': 'john'}, 
{'id': 3, 'start': 42, 'end': 48, 'tag': 'fruit', 'word': 'grapes'}, 
{'id': 4, 'start': 50, 'end': 56, 'tag': 'name', 'word': 'Hanaan'}, 
{'id': 5, 'start': 67, 'end': 76, 'tag': 'name', 'word': 'Mr. Jones'}]}]