I have an NLTK
parsing function that I am using to parse a ~2GB text file of a TREC dataset. The goal for this dataset is tokenize the entire collection, perform some calculations (such as calculating TF-IDF weights, etc), and then to run some queries against our collection to use cosine similarity and return the best results.
As it stands, my program works but takes well over an hour (typically between 44-61 minutes) to run. The timing is broken down as follows:
TOTAL TIME TO COMPLETE: 4487.930628299713
TIME TO GRAB SORTED COSINE SIMS: 35.24157094955444
TIME TO CREATE TFIDF BY DOC: 57.06743311882019
TIME TO CREATE IDF LOOKUP: 0.5097501277923584
TIME TO CREATE INVERTED INDEX: 2.5217013359069824
TIME TO TOKENIZE: 4392.5711488723755
So obviously, the tokenization is accounting for ~98% of the time. I am looking for a way to speed that up.
The tokenization code is below:
def remove_nums(arr):
pattern = '[0-9]'
arr = [re.sub(pattern, '', i) for i in arr]
return arr
def get_words(para):
stop_words = list(stopwords.words('english'))
words = RegexpTokenizer(r'\w+')
lower = [word.lower() for word in words.tokenize(para)]
nopunctuation = [nopunc.translate(str.maketrans('', '', string.punctuation)) for nopunc in lower]
no_integers = remove_nums(nopunctuation)
dirty_tokens = [data for data in no_integers if data not in stop_words]
tokens = [data for data in dirty_tokens if data.strip()]
def driver(file):
myfile = get_input(file)
p = r'<P ID=\d+>.*?</P>'
paras = RegexpTokenizer(p)
document_frequency = collections.Counter()
collection_frequency = collections.Counter()
all_lists = []
currWordCount = 0
currList = []
currDocList = []
all_doc_lists = []
num_paragraphs = len(paras.tokenize(myfile))
print()
print(" NOW BEGINNING TOKENIZATION ")
print()
for para in paras.tokenize(myfile):
group_para_id = re.match("<P ID=(\d+)>", para)
para_id = group_para_id.group(1)
tokens = get_words(para)
tokens = list(set(tokens))
collection_frequency.update(tokens)
document_frequency.update(set(tokens))
para = para.translate(str.maketrans('', '', string.punctuation))
currPara = para.lower().split()
for token in tokens:
currWordCount = currPara.count(token)
currList = [token, tuple([para_id, currWordCount])]
all_lists.append(currList)
currDocList = [para_id, tuple([token, currWordCount])]
all_doc_lists.append(currDocList)
d = {}
termfreq_by_doc = {}
for key, new_value in all_lists:
values = d.setdefault(key, [])
values.append(new_value)
for key, new_value in all_doc_lists:
values = termfreq_by_doc.setdefault(key, [])
values.append(new_value)
I am pretty new to optimization, and am looking for some feedback. I did see this post which condemns a lot of my list comprehensions as "evil", but I can't think of a way around what I am doing.
The code is not well commented, so if for some reason it is not understandable, that is okay. I see other questions on this forum re: speeding up NLTK
tokenization without a lot of feedback, so I am hoping for a positive thread about tokenization optimization programming practices.
By: https://codereview.stackexchange.com/users/25834/reinderien
On: https://codereview.stackexchange.com/questions/230393/tokenizing-sgml-text-for-nltk-analysis
If performance is a concern, this:
arr = [re.sub(pattern, '', i) for i in arr]
is a problem. You're re-compiling your regex on every function call - and every loop iteration! Instead, move the regex to a re.compile()
d symbol outside of the function.
The same applies to re.match("<P ID=(\d+)>", para)
. In other words, you should be issuing something like
group_para_re = re.compile(r"<P ID=(\d+)>")
outside of the loop, and then
group_para_id = group_para_re.match(para)
inside the loop.
That same line has another problem - you're forcing the return value to be a list. Looking at your no_integers
usage, you just iterate over it again, so there's no value to holding onto the entire result in memory. Instead, keep it as a generator - replace your brackets with parentheses.
The same thing applies to nopunctuation
.
stop_words
should not be a list
- it should be a set
. Read about its performance here. Lookup is average O(1), instead of O(n) for a list.
nopunctuation
should be no_punctuation
.