I built this inverted index:
'experiment': {'d1': [1, [0]], ..., 'd30': [2, [12, 40]], ..., 'd123': [3, [11, 45, 67]], ...},
'studi': {'d1': [1, [1]], 'd2': [2, [0, 36]], ..., 'd207': [3, [19, 44, 59]], ...}
For example, the term experiment
appears in document 1 one time at index zero, in document 30 two times at indices 12 and 40, etc. I am wondering how I could count the number of occurrences of each term in the dictionary based on a dictionary of queries that looks like this:
'q1' : ['similar', 'law', ..., 'speed', 'aircraft'],
'q2' : ['structur', 'aeroelast', ..., 'speed', 'aircraft'],
'q225': ['design', 'factor', ..., 'number', '5']
The desired output would look something like this:
'q1' : ['d51', 'd874', ..., 'd717'],
'q2' : ['d51', 'd1147', ..., 'd14'],
'q225': ['d1313', 'd996', ..., 'd193']
With keys representing the query and values representing the documents that the query appeared in, and the list would be sorted in descending order of total term frequencies
A document vector is a dict with items (document, word_count)
. These vectors can be added together by summing the word count for matching document keys with a default word_count of 0.
full_index = {
'experiment': {'d1': [1, [0]], 'd30': [2, [12, 40]], 'd123': [3, [11, 45, 67]] } ,
'study': {'d1': [1, [1]], 'd2': [2, [0, 36]], 'd207': [3, [19, 44, 59]]}
def count_only(docs):
return {d: occurences[0] for d, occurences in docs.items()}
doc_vector_index = {w: count_only(docs) for w, docs in full_index.items()}
for q, words in queries.items():
vectors = [doc_vector_index[word] for word in words if word in doc_vector_index.keys()]
def doc_vector_add(ldoc, rdoc):
res = ldoc.copy()
for doc, count in rdoc.items():
res[doc] = ldoc.get(doc,0) + count
return res
for q, words in queries.items():
vectors = [doc_vector_index[word] for word in words if word in doc_vector_index.keys()]
total_vector = dict(sorted(functools.reduce(doc_vector_add, vectors, {}).items(),
key=lambda item: item[1],
output[q] = list(total_vector.keys())
The summation of doc vectors is handled using reduce functools.reduce(doc_vector_add, vectors, {})
. This produces the doc vector that is the sum of the individual vectors for each word in the query. sorted
is used to sort the keys of the vector.
max_doc_limit = 10
output[q] = list(total_vector.keys())[:max_doc_limit]
Limiting the documents can be handled by slicing before assigning to the output.
sorted(...,key=lambda item: (item[1], -1*int(item[0][1:]),...)
We can change the sorting order of the output by changing the key function passed to sorted
. We use a trick of multiplying the second element in the tuple by -1 to reverse the order from descending to ascending.