Complete noob to Javascript/Typescript here. How can I get the n most common words found in a sample text that contains punctuation such as
const sampleText = "hello world this is taco here is some foo bar text to say hello to my world of tacos in the world of text and it is very cool thanks stackoverflow for it's my birthday. This text also contains punctuation and my mom's car and periods and such. I like apples, pie, and apple pie. Case should be ignored so case and Case are the same. It's and its are two different words!"
I think punctuation can be filtered out of the resulting list after the fact if that makes it easier
Obviously you'll have to break that chunk of text up into words.
Then you'll need to count the occurrences of each (unique) word.
What is a "word"? Well, most straightforwardly, it's the characters between spaces.
You mention that you want to ignore punctuation.
Also, you probably want to ignore lettercase: "Hello" is the same word as "hello".
Step by step:
let lowerText = sampleText.toLowerCase()
This is easiest to do with a regular expression. This one removes every character that's not a letter, number, or dash. It replaces any other character with a space.
let stringWithoutPunct = lowerText.replace(/[^a-zA-Z0-9-]/gi, ' ')
let rawWords = stringWithoutPunct.split(' ')
Note that this will result in some "words" that are the empty string, if there is any place in the string that has two consecutive spaces. We'll make sure to ignore those items in subsequent steps
let uniqueWords: Array<string> = []
for(let word of rawWords) {
// if this word is the empty string, ignore it
if(word === '') continue
// if this word is already on the list, ignore it
if(uniqueWords.includes(word)) continue
// otherwise, add this word to the list
uniqueWords.push(word)
}
We'll convert the list of unique words into a dictionary/hash whose keys are the words and whose values are the count.
let countedWords: Record<string, number> = {}
for(let word of uniqueWords) {
let count = 0
// loop through the list of raw words, counting occurrences of this word
for(let rawWord of rawWords) {
if(rawWord === word) count += 1
}
// now store this word+count pair in the dictionary
countedWords[word] = count
}