So I'm a self-learning NLP and came across this kaggle notebook that does text summarization using an LSTM. When it makes an orderedDict
of words to integers, there's some code that apparently calculates the percentage of rare words in the vocabulary:
thresh=4
cnt, tot_cnt, freq, tot_freq = 0, 0, 0, 0
for key,value in x_tokenizer.word_counts.items():
tot_cnt += 1
tot_freq += value
if(value < thresh):
cnt += 1
freq += value
print("% of rare words in vocabulary:",(cnt/tot_cnt)*100)
print("Total Coverage of rare words:",(freq/tot_freq)*100)
Why is there a threshold value of 4 there? As far as I can see, the word to integer mappings are arbitrary (unless each integer = number of times the word was repeated), so the threshold value of 4 seems a bit arbitrary to me.
Thanks in advance for helping :)
The threshold gives you a chance to ignore "rare" words that wouldn't contribute that much to bag-of-words processing. Similarly, you might want to have an upper threshold so that you could ignore words like "the", "a", etc. that, because of their pervasiveness, also don't contribute much to distinguishing among sentence classes.