Search code examples
javascriptregexregex-lookarounds

Find keyword matches, but ignore based on word proximity in ReGex JS


I'm trying to find matches for a word in a long string, however I want to set up a proximity around the first match, so that any words that match within the proximity get ignored.

For example, if I had an example string, where I'm looking for test:

Lorem ipsum Test sit amet, consectetur adipiscing elit. 
Vestibulum at erat ac enim malesuada pulvinar et nec ante. 
Cras erat ipsum, pellentesque vel volutpat ut, Test eu test. 
Test Quisque tincidunt varius mi.

And this example uses a proximity of 15 words, my end result would show these highlighted:

Lorem ipsum **Test** sit amet, consectetur adipiscing elit. 
Vestibulum at erat ac enim malesuada pulvinar et nec ante. 
Cras erat ipsum, pellentesque vel volutpat ut, **Test** eu test. 
Test Quisque tincidunt varius mi.

So it only finds the Test that is first && greater than 15 words away.


So far I have tried something similar to this:

\btest\W+(?:\w+\W+){15,}?test\b

But this seems to highlight all the words between, when I really only want to highlight test. It also requires me to set 2 params of keywords, which I'd like to only have to use the test keyword once if possible.

Any ideas on how I could accomplish this sort of proximity behavior?


Clarification update:

I have an example on regex tester here: https://regex101.com/r/FDOWZU/1 You can see that it selects the entire amount of words between instances of test. Current output

However, what I want is something more like this: Expected output


Solution

  • Not sure if you mean >=15 or >15 since your code and written logic contradict each other. In any case, you can replace 14 with the number of words sought after. The upper hand 14 in this case ensures test isn't one of the next 15 words, so it will match test only if the next 15 words are not test.


    You can use the following regex:

    See regex in use here

    \btest(?!\W+(?:\w+\W+){0,14}test)
    

    s = `Lorem ipsum Test sit amet, consectetur adipiscing elit. Vestibulum at erat ac enim malesuada pulvinar et nec ante. Cras erat ipsum, pellentesque vel volutpat ut, Test eu test. Test Quisque tincidunt varius mi. Suspendisse vitae lobortis diam. Vestibulum posuere massa id lectus faucibus posuere. Donec non sollicitudin est. Donec libero turpis, malesuada in Test`
    r = /\btest(?!\W+(?:\w+\W+){0,14}test)/gi
    var m
    while(m = r.exec(s)) {
      console.log(m)
    }

    How it works:

    • \b Word boundary
    • test match this literally (case-insensitive with i flag)
    • (?!\W+(?:\w+\W+){0,14}test) negative lookahead ensuring the following does not match:
      • \W+ match any non-word character one or more times
      • (?:\w+\W+){0,14} match between zero and fourteen words
      • test match this literally (case-insensitive again)