Search code examples
pythonkerasnlp

Why do you need a threshold when tokenizing a text corpus?


So I'm a self-learning NLP and came across this kaggle notebook that does text summarization using an LSTM. When it makes an orderedDict of words to integers, there's some code that apparently calculates the percentage of rare words in the vocabulary:

thresh=4

cnt, tot_cnt, freq, tot_freq = 0, 0, 0, 0

for key,value in x_tokenizer.word_counts.items():
    tot_cnt += 1
    tot_freq += value
    if(value < thresh):
        cnt += 1
        freq += value
    
print("% of rare words in vocabulary:",(cnt/tot_cnt)*100)
print("Total Coverage of rare words:",(freq/tot_freq)*100)

Why is there a threshold value of 4 there? As far as I can see, the word to integer mappings are arbitrary (unless each integer = number of times the word was repeated), so the threshold value of 4 seems a bit arbitrary to me.

Thanks in advance for helping :)


Solution

  • The threshold gives you a chance to ignore "rare" words that wouldn't contribute that much to bag-of-words processing. Similarly, you might want to have an upper threshold so that you could ignore words like "the", "a", etc. that, because of their pervasiveness, also don't contribute much to distinguishing among sentence classes.