python python-3.x tuples spacy named-entity-recognition

Merge Tuples in a list - Spacy Trainset related

I have a 'n'(10K or more) tuples in a list like the following below (SpaCy's training format) -

[
('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG')]}),
('Apple are selling their products on Amazon', {'entities': [(36, 42, 'ECOM-ORG')]})
]

The plan is to group same sentences and merge the dictionaries. I obviously went for the brute-force looping idea but that is very slow if I have 10-25K data. Any better/optimal ways to do this?

Desired output -

[('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG'), (36, 42, 'ECOM-ORG')]})]

Solution

Use the fact that str in python can be hashed/indexed.

Here I am using a dictionary with key as the string or 1st element of your tuple

If you have memory limitations, you can batch it out OR use open-source platform like Google Colab

temp = [
('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG')]}),
('Apple are selling their products on Amazon', {'entities': [(36, 42, 'ECOM-ORG')]})
]
data = {}
for i in temp:
    if i[0] in data:data[i[0]]['entities'].append(i[1]['entities'][0])
    else: data[i[0]]= i[1]
temp = [(k,v) for i,(k,v) in enumerate(data.items())]
print(temp)

[('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG'), (36, 42, 'ECOM-ORG')]})]