Search code examples
pythonnltk

How to convert a dictionary with a tuple inside of a nested list?


I'm trying to create a bigram from a dictionary with a specific condition. Below is the example of the dictionary:

dict_example = {'keywords1': ['africa',
  'basic service',
  'class',
  'develop country',
  'disadvantage',
  'economic resource',
  'social protection system']

The specific condition is that I want to create a bigram if the words in each element are more than 1. Below is the code that I have been working on so far:

keywords_bigram_temp = {}
keywords_bigram = {}
for k, v in dict_example.items():
    keywords_bigram_temp.update({k: [word_tokenize(w) for w in v]})
    for k2, v2 in keywords_bigram_temp.items():
        keywords_bigram.update({k2: [list(nltk.bigrams(v3)) for v3 in v2 if len(v3) > 1]})

This code works, but instead of returning a normal tuple within a list (I think this is what bigram normally looked like), it returns a tuple within a nested list. Below is an example of the result:

'keywords1': [[('basic', 'service')],
  [('develop', 'country')],
  [('economic', 'resource')],
  [('social', 'protection'),
   ('social', 'system'),
   ('protection', 'system'),
   ('social', 'protection')]}

What I need is a normal bigram structure, a tuple within a list like so:

'keywords1': [('basic', 'service'),
  ('develop', 'country'),
  ('economic', 'resource'),
  ('social', 'protection'),
  ('protection', 'system')]}

Solution

  • Here's a way to do what your question asks using itertools.combinations():

    from itertools import combinations
    keywords_bigram = {'keywords1': [x for elem in dict_example['keywords1'] if ' ' in elem for x in combinations(elem.split(), 2)]}
    

    Output:

    {'keywords1': [('basic', 'service'), ('develop', 'country'), ('economic', 'resource'), ('social', 'protection'), ('social', 'system'), ('protection', 'system')]}
    

    Explanation:

    • in the dict comprehension, use for elem in dict_example['keywords1'] if ' ' in elem to iterate over all items in the list associated with keywords1 that contain a ' ' character, meaning the words in the element number more than 1
    • use the nested loop for x in combinations(elem.split(), 2) to include every unique combination of 2 words within the multi-word item

    UPDATE:

    Based on OP's clarification that original question contained an extra entry, and that what is required is "in a 'a b c d' context, it will become ('a','b'),('b','c'),('c','d')", here are three alternative solutions.

    Solution #1 using walrus operator := and dict comprehension:

    keywords_bigram = {'keywords1': [x for elem in dict_example['keywords1'] if len(words := elem.split()) > 1 for x in zip(words, words[1:])]}
    

    Solution #2 using a long-hand for loop:

    keywords_bigram = {'keywords1': []}
    for elem in dict_example['keywords1']:
        words = elem.split()
        if len(words) > 1:
            keywords_bigram['keywords1'].extend(zip(words, words[1:]))
    

    Solution #3 without zip():

    keywords_bigram = {'keywords1': []}
    for elem in dict_example['keywords1']:
        words = elem.split()
        if len(words) > 1:
            for i in range(len(words) - 1):
                keywords_bigram['keywords1'].append(tuple(words[i:i+2]))
    

    All three solutions give identical output:

    {'keywords1': [('basic', 'service'), ('develop', 'country'), ('economic', 'resource'), ('social', 'protection'), ('protection', 'system')]}