Search code examples
javascriptnode.jsregexregex-lookaroundsregex-group

Extract text containing match between new line characters


I am trying to extract paragraphs from OCR'd contracts if that paragraph contains key search terms using JS. A user might search for something such as "ship ahead" to find clauses relating to whether a certain customers orders can be shipped early.

I've been banging my head up against a regex wall for quite some time and am clearly just not grasping something.

If I have text like this and I'm searching for the word "match":

let text = "\n\nThis is an example of a paragraph that has the word I'm looking for The word is Match. \n\nThis paragraph does not have the word I want."

I would want to extract all the text between the double \n characters and not return the second sentence in that string.

I've been trying some form of:

let string = `[^\n\n]*match[^.]*\n\n`;

let re = new RegExp(string, "gi");
let body = text.match(re);

However that returns null. Oddly if I remove the periods from the string it works (sorta):

[
  "This is an example of a paragraph that has the word I'm looking for The word is Match \n" +
    '\n'
]

Any help would be awesome.


Solution

  • Extracting some text between identical delimiters containing some specific text is not quite possible without any hacks related to context matching.

    Thus, you may simply split the text into paragraphs and get those containing your match:

    const results = text.split(/\n{2,}/).filter(x=>/\bmatch\b/i.test(x))
    

    You may remove word boundaries if you do not need a whole word match.

    See the JavaScript demo:

    let text = "\n\nThis is an example of a paragraph that has the word I'm looking for The word is Match. \n\nThis paragraph does not have the word I want.";
    console.log(text.split(/\n{2,}/).filter(x=>/\bmatch\b/i.test(x)));