Search code examples
pythonpandas-groupbynested-listsnamed-entity-recognitiondefaultdict

Python: Merge the string in the list (first element) based on the second element


I have a nested list:

"Add changes & things to hot 50 playlist"
"add Madchild to Electro Latino"
"Add artist to my 80'S PARTY"

slot_list = [[['changes', 'entity_name'], ['&', 'entity_name'], ['things', 'entity_name'], ['hot', 'playlist'], ['50', 'playlist']], 
[['Madchild', 'artist'], ['Electro', 'playlist'], ['Latino', 'playlist']],
[['artist', 'music_item'], ['my', 'playlist_owner'], ["80'S", 'playlist'], ['PARTY', 'playlist']]]

I want to merge the string in the [0] position together when their [1] position (slot) elements are the same. And still keep the same nested structure, since that they belong to the same sentence.

the expected output:

output = [[['entity_name', 'changes & things'], ['playlist', 'hot 50']],
 [['artist', 'Madchild'], ['playlist', 'Electro Latino']], [['music_item', 'artist'], 
 ['playlist_owner', 'my'], ['playlist', "80's PARTY"]]]

This is the code I used:

dic = defaultdict(str)
for element in slot_list:
    for word, slot in element:
        dic[slot] += ' ' + str(word)
print([[word, slot] for word, slot in dic.items()])

and I got:

[['entity_name', ' changes & things'], ['playlist', " hot 50 Electro Latino 80'S PARTY"], ['artist', ' Madchild'], ['music_item', ' artist'], ['playlist_owner', ' my']]

, which combine the words with same slot together because of the key-value pair in dict. I also tried groupby but it also does not work out.

Hope someone can give me some guidance! Thanks!


Solution

  • Some denomination:

    • A pair is a list containing two string elements: the first one (value) is the value represented by the second one (key), so the ['changes', 'entity_name'] pair represents a entity name of value "changes", and the ['hot', 'playlist'] pair represents a playlist of value "hot".

    • A slot is a list of pairs.

    Assuming their [1] position are sorted and a slot is

    [
        ['changes', 'entity_name'],
        ['&', 'entity_name'],
        ['things', 'entity_name'],
        ['hot', 'playlist'],
        ['50', 'playlist'],
    ]
    

    you can group the slot using each pair's second element

    # itertools.groupby(slot, key=lambda x: x[1])
    [
        ['entity_name', [
            ['changes', 'entity_name'],
            ['&', 'entity_name'],
            ['things', 'entity_name'],
        ],
        ['playlist', [
            ['hot', 'playlist'],
            ['50', 'playlist']
        ],
    ]
    

    For each grouped pairs, join all the first elements using a space:

    import itertools
    
    def group_slots(slots):
        # For each slot in the list of slots, group it
        return [group_slot(slot) for slot in slots]
    
    def group_slot(slot): 
        return [[key, ' '.join(pair[0] for pair in pairs)] 
                for key, pairs in itertools.groupby(slot, key=lambda x: x[1])]
    
    

    Then

    slots = [
        [
            ['changes', 'entity_name'],
            ['&', 'entity_name'],
            ['things', 'entity_name'],
            ['hot', 'playlist'],
            ['50', 'playlist'],
        ],
        [
            ['Madchild', 'artist'],
            ['Electro', 'playlist'],
            ['Latino', 'playlist'],
        ],
        [
            ['artist', 'music_item'],
            ['my', 'playlist_owner'],
            ["80'S", 'playlist'],
            ['PARTY', 'playlist'],
        ],
    ]
    result = group_slots(slots)
    print(result)
    

    outputs

    [
        [
            ['entity_name', 'changes & things'],
            ['playlist', 'hot 50'],
        ],
        [
            ['artist', 'Madchild'], 
            ['playlist', 'Electro Latino'],
        ],
        [
            ['music_item', 'artist'], 
            ['playlist_owner', 'my'], 
            ['playlist', "80'S PARTY"],
        ],
    ]