Is there a way to get regexp to match as much of a specific word as is possible? For example, if I am looking for the following words: yesterday, today, tomorrow
I want the following full words to be extracted:
The following whole words should fail to match (basically, spelling mistakes):
The best I could come up with so far is:
\b((tod(a(y)?)?)|(tom(o(r(r(o(w)?)?)?)?)?)|(yest(e(r(d(a(y)?)?)?)?)?))\b
(Example)
Note: I could implement this using a finite state machine but thought it would be a giggle to get regexp to do this. Unfortunately, anything I come up with is ridiculously complex and I'm hoping that I've just missed something.
The regex you are looking for should include optional groups with alternations.
\b(yest(?:e(?:r(?:d(?:ay?)?)?)?)?|tod(?:ay?)?|tom(?:o(?:r(?:r(?:ow?)?)?)?)?)\b
See demo
Note that \b
word boundaries are very important since you want to match whole words only.
Regex explanation:
\b
- leading word boundary(yest(?:e(?:r(?:d(?:ay?)?)?)?)?|tod(?:ay?)?|tom(?:o(?:r(?:r(?:o(?:w)?)?)?)?)?)
- a capturing group matching
yest(?:e(?:r(?:d(?:ay?)?)?)?)?
- yest
, yeste
, yester
, yesterd
, yesterda
or yesterday
tod(?:ay?)?
- tod
or toda
or today
tom(?:o(?:r(?:r(?:o(?:w)?)?)?)?)?
- tom
, tomo
, tomor
, tomorr
, tomorro
, or tomorrow
\b
- trailing word boundaryimport re
p = re.compile(ur'\b(yest(?:e(?:r(?:d(?:ay?)?)?)?)?|tod(?:ay?)?|tom(?:o(?:r(?:r(?:ow?)?)?)?)?)\b', re.IGNORECASE)
test_str = u"yest\nyeste\nyester\nyesterd\nyesterda\nyesterday\ntod\ntoda\ntoday\ntom\ntomo\ntomor\ntomorr\ntomorro\ntomorrow\n\nyesteray\ntomorow\ntommorrow\ntody\nyesteday"
print(p.findall(test_str))
# => [u'yest', u'yeste', u'yester', u'yesterd', u'yesterda', u'yesterday', u'tod', u'toda', u'today', u'tom', u'tomo', u'tomor', u'tomorr', u'tomorro', u'tomorrow']