Search code examples
regexnlpextractregex-greedynon-greedy

Regular Expression to limit a string to the shortest match versus the longest match (non-greedy group)?


I'm searching within paragraphs of text.

I'd like to find strings within those paragraphs that start with a specific word, and then grab the text that immediately follows that matching word. I'd like to stop when encountering the first period, exclamation mark, question mark, or new line ... If none of these are found within 100 characters of the search word, I'd like to cut the string off at the word boundary closest to the 100 character limit.

How can I do this?

EXAMPLE

string: "A test sentence containing an ngram and ending with a period. Another sentence that does not have the word we're searching for and runs on until we're past 100 characters."

regex: /\bngram(.{0,100})(\.|\b)/i

desired output: ' and ending with a period'

In this case, my regex returns " and ending with a period. Another sentence that does not have the word we're searching for and runs." It goes on longer than I wanted because it's the period/word-boundary capture group is greedy (maybe?). I don't know how to limit to the shorter match, versus the longest match.


Solution

  • use a negated character class that excludes the dot!

    /\bngram([^.]{0,100})(\b|\.)/i