Search code examples
javascripttypescriptalgorithmtext-processing

How to get the N most common words in a string in Typescript?


Complete noob to Javascript/Typescript here. How can I get the n most common words found in a sample text that contains punctuation such as

const sampleText = "hello world this is taco here is some foo bar text to say hello to my world of tacos in the world of text and it is very cool thanks stackoverflow for it's my birthday. This text also contains punctuation and my mom's car and periods and such. I like apples, pie, and apple pie. Case should be ignored so case and Case are the same. It's and its are two different words!"

I think punctuation can be filtered out of the resulting list after the fact if that makes it easier


Solution

  • Obviously you'll have to break that chunk of text up into words.

    Then you'll need to count the occurrences of each (unique) word.

    What is a "word"? Well, most straightforwardly, it's the characters between spaces.

    You mention that you want to ignore punctuation.

    Also, you probably want to ignore lettercase: "Hello" is the same word as "hello".


    Step by step:

    1. Convert the entire string to lowercase
    let lowerText = sampleText.toLowerCase()
    
    1. Remove punctuation from the string

    This is easiest to do with a regular expression. This one removes every character that's not a letter, number, or dash. It replaces any other character with a space.

    let stringWithoutPunct = lowerText.replace(/[^a-zA-Z0-9-]/gi, ' ')
    
    1. Separate that chunk of text into separate words
    let rawWords = stringWithoutPunct.split(' ')
    

    Note that this will result in some "words" that are the empty string, if there is any place in the string that has two consecutive spaces. We'll make sure to ignore those items in subsequent steps

    1. Produce a list of unique words
    let uniqueWords: Array<string> = []
    for(let word of rawWords) {
      // if this word is the empty string, ignore it
      if(word === '') continue
      // if this word is already on the list, ignore it
      if(uniqueWords.includes(word)) continue
      // otherwise, add this word to the list
      uniqueWords.push(word)
    }
    
    1. Count the occurrences of each word

    We'll convert the list of unique words into a dictionary/hash whose keys are the words and whose values are the count.

    let countedWords: Record<string, number> = {}
    for(let word of uniqueWords) {
      let count = 0
      // loop through the list of raw words, counting occurrences of this word
      for(let rawWord of rawWords) {
        if(rawWord === word) count += 1
      }
      
      // now store this word+count pair in the dictionary
      countedWords[word] = count
    }