Search code examples
javascriptregexstringindexof

Javascript: find all occurrences of word in text document


I'm trying to write a Javascript function to find indices of all occurrences of a word in a text document. Currently this is what I have--

//function that finds all occurrences of string 'needle' in string 'haystack'
function getMatches(haystack, needle) {
  if(needle && haystack){
    var matches=[], ind=0, l=needle.length;
    var t = haystack.toLowerCase();
    var n = needle.toLowerCase();
    while (true) {
      ind = t.indexOf(n, ind);
      if (ind == -1) break;
      matches.push(ind);
      ind += l;
  }
  return matches;
}

However, this gives me a problem since this matches the occurrences of the word even when it's part of a string. For example, if the needle is "book" and haystack is "Tom wrote a book. The book's name is Facebook for dummies", the result is the index of 'book', 'book's' and 'Facebook', when I want only the index of 'book'. How can I accomplish this? Any help is appreciated.


Solution

  • Here's the regex I propose:

    /\bbook\b((?!\W(?=\w))|(?=\s))/gi
    

    To fix your problem. Try it with the exec() method. The regexp I provided will also consider words like "booklet" that occur in the example sentence you provided:

    function getMatches(needle, haystack) {
        var myRe = new RegExp("\\b" + needle + "\\b((?!\\W(?=\\w))|(?=\\s))", "gi"),
            myArray, myResult = [];
        while ((myArray = myRe.exec(haystack)) !== null) {
            myResult.push(myArray.index);
        }
        return myResult;
    }
    

    Edit

    I've edited the regexp to account for words like "booklet" as well. I've also reformatted my answer to be similar to your function.

    You can do some testing here