Search code examples
pythonlistdictionarynlp

Putting in Pieces of Information in A Nested Dictionary (Python)


I'm trying to create a nested dictionary that tells me what document each word appears in and in which position it appears in: For example:

dictionary ={}
textfile_list = ['file1.txt', 'file2.txt', 'file3.txt']
file_contents = ['mario luigi friend mushroom', 'rick mario morty portal summer mario', 'peter griffin shop'] 
#first element corresponds to the contents of file1.txt and etc.

words = [['mario', 'luigi', 'friend', 'mushroom'],
        ['rick', 'mario', 'morty', 'portal', 'summer', 'mario'],
        ['peter', 'griffin', 'shop']] #tokenising the text

I'd want print(dictionary['mario']) to give [{'file1.txt': [0]}, {'file2.txt': [1,5]} ]

My code so far is:

dict = {}
for i in range(len(textfile_list)):
    check = file_contents
    for item in words:  #a list of every word from every file ['word1','wordn','word3',...]
  
        if item in check:
            if item not in dict:
                dict[item] = []
  
            if item in dict:
                dict[item].append(textfile_list[i])

dict = {k: list(set(v)) for k, v in dict.items()}

I don't know how to implement the postion of the word in a nested dictionary which I don't have at the moment! Could anyone help?


Solution

  • You have one layer of nesting too many. Your first description corresponds to a dictionary whose keys are words, and whose values are dictionaries of (filename, position_list) pairs (e.g. dictionary['mario'] = {'file1.txt': [0], 'file2.txt': [1, 5]} ) rather than a dictionary whose keys are words, and whose values are a list of dictionaries with one filename per dictionary, as you had.

    textfile_list = ['file1.txt', 'file2.txt', 'file3.txt']
    file_contents = ['mario luigi friend mushroom', 'rick mario morty portal summer mario',
                     'peter griffin shop']
    # first element corresponds to the contents of file1.txt and etc.
    
    # words = [string_list.split() for string_list in file_contents]
    
    words = [['mario', 'luigi', 'friend', 'mushroom'],
             ['rick', 'mario', 'morty', 'portal', 'summer', 'mario'],
             ['peter', 'griffin', 'shop']]  # tokenising the text
    
    dictionary = {}
    
    for textfile_name, file_strings in zip(textfile_list, words):
        for position, word in enumerate(file_strings):
            if word not in dictionary:
                dictionary[word] = {}
            if textfile_name not in dictionary[word]:
                dictionary[word][textfile_name] = []
    
            dictionary[word][textfile_name].append(position)
    
    print(dictionary['mario'])
    >>> {'file1.txt': [0], 'file2.txt': [1, 5]}
    

    I'm not sure what the final line is for, since there are no duplicates currently; in any case, don't use dict as a variable name in Python, since it's a builtin.