Search code examples
pythonregexpython-regex

Extract subtring using regex python


Hello I have this string and I need extract from this some sub strings according some delimiters:

string = """
1538 a
123
skua456
789
5
g
15563 blu55g
b
456
16453 a
789
5
16524 blu
g
55
1734 a
987
987
55
aasf
552
18278 blu
ttry
"""

And I need extract exactly this strings:

string1 = 
"""
1538 a
123
skua456
789
5
g
15563 blu55g
"""
string2 = """
16453 a
789
5
16524 blu
"""
string3 = 
"""
1734 a
987
987
55
aasf
552
18278 blu
"""

I have tried a lot of types: re.findall, re.search, re.match. But I never geted the result expected.

For eg: this code bellow print all string:

re.split(r"a(.*)blu", a)[0]

Solution

  • You do not need a regex for this, you may get lines between lines containing a and blu:

    text = "1538 a\n123\nskua456\n789\n5\ng\n15563 blu55g\nb\n456\n16453 a\n789\n5\n16524 blu\ng\n55\n1734 a\n987\n987\n55\naasf\n552\n18278 blu\nttry"
    f = False
    result = []
    block = []
    for line in text.splitlines():
        if 'a' in line:
            f = True
        if f:
            block.append(line)
        if 'blu' in line and f:
            f = False
            result.append("\n".join(block))
            block = []
    
    print(result)
    # => ['1538 a\n123\nskua456\n789\n5\ng\n15563 blu55g', '16453 a\n789\n5\n16524 blu', '1734 a\n987\n987\n55\naasf\n552\n18278 blu']
    

    See the Python demo.

    With regex, you can use

    print( re.findall(r'(?m)^.*a(?s:.*?)blu.*', text) )
    print( re.findall(r'(?m)^.*a(?:\n.*)*?\n.*blu.*', text) )
    

    See this Python demo.

    The first regex means:

    • (?m)^ - multiline mode on, so ^ matches any line start position
    • .*a - any zero or more chars other than line break chars as many as possible, and then a
    • (?s:.*?) - any zero or more chars including line break chars as few as possible
    • blu.* - blue and then any zero or more chars other than line break chars as many as possible.

    The second regex matches

    • (?m)^ - start of a line
    • .*a - any zero or more chars other than line break chars as many as possible, and then a
    • (?:\n.*)*? - zero or more lines, as few as possible
    • \n.*blu.* - a newline, any zero or more chars other than line break chars as many as possible, blu and any zero or more chars other than line break chars as many as possible.