Search code examples
pythonpython-3.xtuplesspacynamed-entity-recognition

Merge Tuples in a list - Spacy Trainset related


I have a 'n'(10K or more) tuples in a list like the following below (SpaCy's training format) -

[
('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG')]}),
('Apple are selling their products on Amazon', {'entities': [(36, 42, 'ECOM-ORG')]})
]

The plan is to group same sentences and merge the dictionaries. I obviously went for the brute-force looping idea but that is very slow if I have 10-25K data. Any better/optimal ways to do this?

Desired output -

[('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG'), (36, 42, 'ECOM-ORG')]})]

Solution

  • Use the fact that str in python can be hashed/indexed.

    Here I am using a dictionary with key as the string or 1st element of your tuple

    If you have memory limitations, you can batch it out OR use open-source platform like Google Colab

    temp = [
    ('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG')]}),
    ('Apple are selling their products on Amazon', {'entities': [(36, 42, 'ECOM-ORG')]})
    ]
    data = {}
    for i in temp:
        if i[0] in data:data[i[0]]['entities'].append(i[1]['entities'][0])
        else: data[i[0]]= i[1]
    temp = [(k,v) for i,(k,v) in enumerate(data.items())]
    print(temp)
    
    [('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG'), (36, 42, 'ECOM-ORG')]})]