Search code examples
pythondictionarycountnested-lists

Count number of occurrences of dictionary values as nested list based on a query


I built this inverted index:

{
    'experiment': {'d1': [1, [0]], ..., 'd30': [2, [12, 40]], ..., 'd123': [3, [11, 45, 67]], ...}, 

    'studi': {'d1': [1, [1]], 'd2': [2, [0, 36]], ..., 'd207': [3, [19, 44, 59]], ...}

}

For example, the term experiment appears in document 1 one time at index zero, in document 30 two times at indices 12 and 40, etc. I am wondering how I could count the number of occurrences of each term in the dictionary based on a dictionary of queries that looks like this:

{
    'q1'  : ['similar', 'law', ..., 'speed', 'aircraft'],
    'q2'  : ['structur', 'aeroelast', ..., 'speed', 'aircraft'], 
    ...
    'q225': ['design', 'factor', ..., 'number', '5']
}

The desired output would look something like this:

{
    'q1'  : ['d51', 'd874', ..., 'd717'], 
    'q2'  : ['d51', 'd1147', ..., 'd14'],
    ...,
    'q225': ['d1313', 'd996', ..., 'd193']
}

With keys representing the query and values representing the documents that the query appeared in, and the list would be sorted in descending order of total term frequencies


Solution

  • Map queries to document vectors

    A document vector is a dict with items (document, word_count). These vectors can be added together by summing the word count for matching document keys with a default word_count of 0.

    CONVERT INDEX TO DOC VECTORS

    full_index = {
        'experiment': {'d1': [1, [0]],  'd30': [2, [12, 40]],  'd123': [3, [11, 45, 67]] } ,
        'study': {'d1': [1, [1]], 'd2': [2, [0, 36]],  'd207': [3, [19, 44, 59]]}
    }
    
    def count_only(docs):
        return {d: occurences[0] for d, occurences in docs.items()}
    
    doc_vector_index = {w: count_only(docs) for w, docs in full_index.items()}
    

    MAP LIST OF QUERY WORDS TO LIST OF DOC VECTORS

    for q, words in queries.items():
        vectors = [doc_vector_index[word] for word in words if word in doc_vector_index.keys()]
    

    SUM DOC VECTORS AND SORT

    def doc_vector_add(ldoc, rdoc):
        res = ldoc.copy()
        for doc, count in rdoc.items():
            res[doc] = ldoc.get(doc,0) + count
        return res
    
    for q, words in queries.items():
        vectors = [doc_vector_index[word] for word in words if word in doc_vector_index.keys()]
        total_vector = dict(sorted(functools.reduce(doc_vector_add, vectors, {}).items(), 
            key=lambda item: item[1], 
            reverse=True))
        output[q] = list(total_vector.keys())
    

    The summation of doc vectors is handled using reduce functools.reduce(doc_vector_add, vectors, {}). This produces the doc vector that is the sum of the individual vectors for each word in the query. sorted is used to sort the keys of the vector.

    LIMIT TO TOP N DOCUMENTS

    max_doc_limit = 10
    output[q] = list(total_vector.keys())[:max_doc_limit]
    

    Limiting the documents can be handled by slicing before assigning to the output.

    ORDER BY COUNT DESC, DOC_ID ASC

    sorted(...,key=lambda item: (item[1], -1*int(item[0][1:]),...)
    

    We can change the sorting order of the output by changing the key function passed to sorted. We use a trick of multiplying the second element in the tuple by -1 to reverse the order from descending to ascending.