Search code examples
pythonregex-grouppython-re

Regex pattern for string - python


I would like to group string in this format:

Some_text Some_text 1 2 3
BEGIN Some_text Some_text
44 76 1321
Some_text Some_text
END Some_text
Some_Text Some_text 1 4 5

I would like to group it from BEGIN to END with it, like that:

Some_text Some_text 1 2 3
<!-- START -->
BEGIN Some_text Some_text
44 76 1321
Some_text Some_text
END <!-- END --> Some_text

Some_Text Some_text 1 4 5

<!-- START --> and <!-- END --> - this is just a comment on the start and end of grouping. I want to get only text between BEGIN and END

I have something like that, but it doesn't work for every case - when there is a lot of data, it just doesn't work:

reg = re.compile(rf"{begin}[\-\s]+(.*)\n{end}", re.DOTALL)
core = re.search(reg, text).group(1)
lines = core.split("\n")

text is my string and then after grouping I exchange it for a list - I don't know how to make this regex directly from the list, then I would not have to do it on string text but on python list text

Give me some tips or help how I can solve it.

Sample code:

import re
text="Some_text Some_text 1 2 3\nBEGIN Some_text Some_text\n44 76 1321\nSome_text Some_text\nEND Some_text\nSome_Text Some_text 1 4 5"

begin = "BEGIN"
end = "END"
reg = re.compile(rf"{begin}[\-\s]+(.*)\n{end}", re.DOTALL)
core = re.search(reg, text).group(1)
lines = core.split("\n")

print(lines)

It works but I don't know why sometimes it doesn't, when it takes a lot of text e.g: 20k words I want to get only text between BEGIN and END


Solution

  • You might use

    ^BEGIN\b(.*(?:\r?\n(?!(?:BEGIN|END)\b).*)*)\r?\nEND
    

    Regex demo | Python demo

    If you want to include BEGIN and END, you can omit the capturing group

    ^BEGIN\b.*(?:\r?\n(?!(?:BEGIN|END)\b).*)*\r?\nEND
    

    Regex demo | Python demo

    Code example

    import re
    
    regex = r"^BEGIN\b(.*(?:\r?\n(?!(?:BEGIN|END)\b).*)*)\r?\nEND"
    
    test_str = ("Some_text Some_text 1 2 3\n"
        "BEGIN Some_text Some_text\n"
        "44 76 1321\n"
        "Some_text Some_text\n"
        "END Some_text\n"
        "Some_Text Some_text 1 4 5\n")
    
    print(re.findall(regex, test_str, re.MULTILINE))
    

    Output

    [' Some_text Some_text\n44 76 1321\nSome_text Some_text']