I'm doing an analysis of a text corpus with about 135k documents (several pages per document), and a vocabulary of about 800k words. I noticed that something like half of the vocabulary is words with a frequency of 1 or 2, so I want to remove them.
So I'm running something like this:
remove_indices = np.array(index_df[index_df['frequency'] <= 2]['index']).astype(int)
for file_name in tqdm(corpus):
content = corpus[file_name].astype(int)
content = [index for index in content if index not in remove_indices]
corpus[file_name] = np.array(content).astype(np.uint32)
Where corpus
looks something like:
{
'filename1.txt': np.array([43, 177718, 3817, ...., 28181]).astype(np.uint32),
'filename2.txt': ....
}
and each word was previously encoded to a positive integer index.
The problem lies in content = [index for index in content if index not in remove_indices]
which needs to go through len(remove_indices) * len(content)
number of checks with each iteration. This would take forever (tqdm is telling me 100h+). Any tips on how to speed this up?
What I've tried so far
remove_indices
after it has been removed from the corpus. Still taking forever...You could use numpy.isin()
method https://numpy.org/devdocs/reference/generated/numpy.isin.html instead of this list comprehension.
Alternatively, you could create a set
of existing words/indices. Then this in
operation will be a O(1) instead of O(n) (where n is the length of the array).