I have a 'n'(10K or more) tuples in a list like the following below (SpaCy's training format) -
[
('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG')]}),
('Apple are selling their products on Amazon', {'entities': [(36, 42, 'ECOM-ORG')]})
]
The plan is to group same sentences and merge the dictionaries. I obviously went for the brute-force looping idea but that is very slow if I have 10-25K data. Any better/optimal ways to do this?
Desired output -
[('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG'), (36, 42, 'ECOM-ORG')]})]
Use the fact that str
in python can be hashed/indexed.
Here I am using a dictionary with key as the string or 1st element of your tuple
If you have memory limitations, you can batch it out OR use open-source platform like Google Colab
temp = [
('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG')]}),
('Apple are selling their products on Amazon', {'entities': [(36, 42, 'ECOM-ORG')]})
]
data = {}
for i in temp:
if i[0] in data:data[i[0]]['entities'].append(i[1]['entities'][0])
else: data[i[0]]= i[1]
temp = [(k,v) for i,(k,v) in enumerate(data.items())]
print(temp)
[('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG'), (36, 42, 'ECOM-ORG')]})]