Search code examples
pythonregextexttext-miningchangelog

How to extract string between numbers? (And keep first number in the string?)


I am trying to extract data from a change log using RegEx. Here is an example how the change log is structured:

96545
this is some changes in the ticket
some new version: x.x.22
another change
new version: x.y.2.2
120091
this is some changes in the ticket
some new version: z.z.22
another change
another change
another change
new version: z.y.2.2
120092
...
...
...
  • Each data point starts with an ID which has a range of 5 to 6 digits.
  • Moreover there is a variable amount of changes (lines) in the log per ID.
  • Each data point ends with new version: ***. *** is string which is variable for every ID.

I was using the RegExStrom Tester to test my RegEx.

So far I have: ^\w{5,6}(.|\n)*?\d{5,6} however the result includes the ID from the next ticket, which I need to avoid.

Result:

96545
this is some changes in the ticket
some new version: x.x.22
another change
new version: x.y.2.2
120091 

Expected Result:

96545
this is some changes in the ticket
some new version: x.x.22
another change
new version: x.y.2.2

Solution

  • If the problem was that you capture the ID of the next Ticket just use positive look ahead to mach it but not capture it, or consume it:

    # end of tickets is the end of line that the line after it contains the Id of the next ticket
    pattern = r"\d{5,6}[\s\S]*?(?=\n\d{5,6})"
    
    # to extract first ticket info just use search
    print(re.search(pattern, text).group(0))
    
    # to extract all tickets info in a list use findall
    print(re.findall(pattern, text))
    
    # if the file is to big and you want to extract tickets in lazy mode
    for ticket in re.finditer(pattern,text):
        print(ticket.group(0))