Search code examples
pythonregexregex-lookaroundsoverlapfindall

Extract words surrounding a RegEx match using re.findall when there exists an overlapping index


The goal is to extract 100 characters before and after the keyword "bankruptcy".

str = "The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s."

pattern = r"(?i)\s*(?:\w|\W){0,100}\b(?:bankruptcy)\b\s*(?:\w|\W){0,100}"

import re

output = re.findall(pattern, str)

Expected output:

['The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s.', 
 'The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s.']

Current output: ['The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s.']

Is there a way to resolve overlapping indexes using re.findall?


Solution

  • You may use the following solution based on the PyPi regex module (install with pip install regex):

    import regex
    text = "The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s."
    pattern = r"\b(?<=(.{0,100}))(bankruptcy)\b(?=(.{0,100}))"
    print( [f"{x}{y}{z}" for x,y,z in regex.findall(pattern, text, flags=regex.I|regex.DOTALL)] )
    # => ['The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s.', 'The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s.']
    

    See the Python demo online. Regex details:

    • \b - a word boundary
    • (?<=(.{0,100})) - a positive lookbehind that matches a location that is immediately preceded with any 0 to 100 chars (note regex.DOTALL allows the . to match any chars) that are captured into Group 1
    • (bankruptcy) - Group 2: bankruptcy (matched in a case insensitive way due to regex.I flag)
    • \b - a word boundary
    • (?=(.{0,100})) - a positive lookahead that matches a location immediately followed with 0 to 100 chars.

    Since the lookbehinds and lookaheads do not consume the patterns they match, you may access all the chars on the left and on the right of the word you search for.

    Note re can't be used because it does not allow non-fixed width patterns in lookbehinds.