I have written a code which takes list of items and outputs a json with unique items as keys and frequency as value.
The code below works fine when I test it
const tokenFrequency = tokens =>{
const setTokens=[...new Set(tokens)]
return setTokens.reduce((obj, tok) => {
const frequency = tokens.reduce((count, word) =>word===tok?count+1:count, 0);
const containsDigit = /\d+/;
if (!containsDigit.test(tok)) {
obj[tok.toLocaleLowerCase()] = frequency;
}
return obj;
}, new Object());
}
like
const x=["hello","hi","hi","whatsup","hey"]
console.log(tokenFrequency(x))
produces the output
{ hello: 1, hi: 2, whatsup: 1, hey: 1 }
but when i try with huge data corpus's list of words it seem to produce wrong result.
say if i feed a list words with the length of list being 14000+ it produces wrong results.
example: https://github.com/Nahdus/word2vecDataParsing/blob/master/corpous/listOfWords.txt when this list in this page(linked above) to function the frequency of word "is" comes out to be 4, but the actual frequency is 907.
why does it behave like this for large data? how can this be fixed?
You would need to normalize your tokens first by applying toLowerCase()
to them, or a way to diferentiate between words that are the same but only differ in capitalization.
Reason:
Your small dataset has no Is
words (with uppercase 'i'). The large dataset does have occurences of Is
(with uppercase 'i'), which apparently has a frequency 4
, which in turn overwrites your lowercase is
's frequency.