I have to construct a document with term weights for each word in the corpus and I have a couple of pre-processing steps to do. One of them is to remove every word appearing less than 5 times in the entire corpus. This is what I have done and I'm sure it is not the most efficient method.
Suppose I have 10 HTML documents. I read from each document, tokenize using nltk and BeautifulSoup, write the output to a file . I have to do this for all 10 documents first. Again read all 10 documents to check how many times a particular term appears in the ENTIRE CORPUS and write the output to different files.
Since I am reading and writing each file twice(have to do this for 1000 documents) , it is taking very long to execute the program. Would really appreciate if anyone can suggest an alternate method that doesn't take so long and is way more efficient. I am using Python3 .
Thank you
def remove_words(temp_path):
#####PREPROCESING : Remove words that occur only once in the entire corpus , i.e words with value =1
temp_dict={}
with open(temp_path) as file:
for line in file:
(key,value)=line.split()
temp_dict[key]=value
#print("Lenght before removing words appearing just once: %s"%len(temp_dict))
check_dir=temp_dict.copy()
new_dir=full_dir.copy()
for k,v in check_dir.items(): #Compare each temperary dictionary with items in full_dir. If a match exits and the key value=1, delete it
for a,b in new_dir.items():
if k==a and b==1:
del temp_dict[k]
#print("Length after removing words appearing just once: %s \n"%len(temp_dict))
return temp_dict
def calc_dnum(full_dir,temp_dict):
#Function to calculate the total number of documents each word appears in
dnum_list={}
for k,v in full_dir.items():
for a,b in temp_dict.items():
if k==a:
dnum_list[a]=v
return dnum_list
My guess is that your code is spending most of its time in this block:
for k,v in check_dir.items():
for a,b in new_dir.items():
if k==a and b==1:
del temp_dict[k]
and this block...
for k,v in full_dir.items():
for a,b in temp_dict.items():
if k == a:
dnum_list[a] = v
You are doing a lot of unnecessary work here. You are iterating over new_dir
and temp_dict
many times over when once would be enough.
These two blocks can be simplified to:
for a, b in new_dir.items():
if check_dir.get(a) == 1:
del temp_dict[a]
and:
for a, b in temp_dict.items():
if a in full_dir:
dnum_list[a] = v