python full-text-search information-retrieval

Efficient way to find specific words in the entire corpus

I have to construct a document with term weights for each word in the corpus and I have a couple of pre-processing steps to do. One of them is to remove every word appearing less than 5 times in the entire corpus. This is what I have done and I'm sure it is not the most efficient method.

Suppose I have 10 HTML documents. I read from each document, tokenize using nltk and BeautifulSoup, write the output to a file . I have to do this for all 10 documents first. Again read all 10 documents to check how many times a particular term appears in the ENTIRE CORPUS and write the output to different files.

Since I am reading and writing each file twice(have to do this for 1000 documents) , it is taking very long to execute the program. Would really appreciate if anyone can suggest an alternate method that doesn't take so long and is way more efficient. I am using Python3 .

Thank you

def remove_words(temp_path):
#####PREPROCESING :  Remove words that occur only once in the entire corpus , i.e words with value =1
        temp_dict={}
        with open(temp_path) as file:
                for line in file:
                        (key,value)=line.split()
                        temp_dict[key]=value
        #print("Lenght before removing words appearing just once: %s"%len(temp_dict))
        check_dir=temp_dict.copy()
        new_dir=full_dir.copy()
        for k,v in check_dir.items(): #Compare each temperary dictionary with items in full_dir. If a match exits and the key value=1, delete it
                for a,b in new_dir.items():
                        if k==a and b==1:
                                del temp_dict[k]


        #print("Length after removing words appearing just once: %s \n"%len(temp_dict))
        return temp_dict


def calc_dnum(full_dir,temp_dict):
#Function to calculate the total number of documents each word appears in       
        dnum_list={}
        for k,v in full_dir.items():
                for a,b in temp_dict.items():
                        if k==a:
                                dnum_list[a]=v

        return dnum_list

Solution

My guess is that your code is spending most of its time in this block:

for k,v in check_dir.items():
    for a,b in new_dir.items():
        if k==a and b==1:
            del temp_dict[k]

and this block...

    for k,v in full_dir.items():
        for a,b in temp_dict.items():
            if k == a:
                dnum_list[a] = v

You are doing a lot of unnecessary work here. You are iterating over new_dir and temp_dict many times over when once would be enough.

These two blocks can be simplified to:

for a, b in new_dir.items():
    if check_dir.get(a) == 1:
        del temp_dict[a]

and:

for a, b in temp_dict.items():
    if a in full_dir:
        dnum_list[a] = v