Search code examples
pythonregexpython-3.5tarfile

Trying to restrict regex match scope


Python newb here, please excuse the dumb question. I am trying to extract log data from inside of a group of gzipped files. The data spans multiple lines so I am trying to extract each file from its compressed tar file and read it as a single object like this: Regex:

first_match = re.compile(r"(?P<date>\d{4}[-]?\d{1,2}[-]?\d{1,2} \d{1,2}:\d{1,2}:\d{1,2}).*?http://servername:99999/chargeit.*?manager_event=first.*?\bwantThisUser=([^&]*).*?\b_operator=(\w+).*?request\:.*?Want-To-Have-This\:\s\*123\*0\#")

 tfile = tarfile.open("logfile-year-month-day.number.log.tar.gz", "r")
     for filename in tfile.getmembers():
          f = tfile.extractfile(filename).read()
          f = str(f)
          for match in first_match.finditer(f):
              linecount = linecount + 1
              print(linecount, match.group(1), match.group(2), match.group(3))

I am trying to match the timestamp, and two other groups in the log file. Log data looks somewhat like this, if printed line by line:

2016-12-16 20:43:47 DEBUG[ispatcher-12570] this.is.the.api.Api - http://servername:99999/chargeit?session_id=1d7cb257e22946abbb3a14b17f232505&manage_event=first&wantThisUser=4119057000083&_source=operator3 b90e7798-8abd-4cf4-9660-45d6527e2804 request:
 HEADERS:
  this-is-a-header: 200
  Want-To-Have-This: *123*200#
  Host: servername:99999
  Accept: */*
  User-Agent: AHC/2.0
  Timeout-Access: <function1>
 CONTENT:

2016-12-16 20:43:47 DEBUG[ispatcher-12571] this.is.the.api.Api - http://servername:99999/chargeit?session_id=20111&manage_event=first&wantThisUser=4119185011005&_operator=operator4 926fa104-e72f-46e8-a5fc-912ef9707a01 request:
 HEADERS:
  this-is-a-header: 0
  Want-To-Have-This: *123*0#
  Host: servername:99999
  Accept: */*
  User-Agent: AHC/2.0
  Timeout-Access: <function1>
 CONTENT:

2016-12-16 20:26:29 DEBUG[ispatcher-12563] this.is.the.api.Api - http://servername:99999/chargeit?session_id=a5e456ad2f5645c39a580463630cd3db&manage_event=first&wantThisUser=4119023107960&_source=operator2 1021c087-1918-40a3-a7c1-4b7c37690471 request:
 HEADERS:
  this-is-a-header: 1000*0111111111
  Want-To-Have-This: *123*1000*0111111111#
  Host: servername:99999
  Accept: */*
  User-Agent: AHC/2.0
  Timeout-Access: <function1>
 CONTENT:

I am expecting to catch this:

    2016-12-16 20:43:47 DEBUG[ispatcher-12571] this.is.the.api.Api - http://servername:99999/chargeit?session_id=20111&manage_event=first&wantThisUser=4119185011005&_operator=operator4 926fa104-e72f-46e8-a5fc-912ef9707a01 request:
 HEADERS:
  this-is-a-header: 0
  Want-To-Have-This: *123*0#

And the groups I'm hoping to capture are the timestamp: (2016-12-16 20:43:4), the value of wantThisUser= (4119185011005) and _operator= (operator4).

Instead the regex captures the target line, and the one(s) above it:

2016-12-16 20:43:47 DEBUG[ispatcher-12570] this.is.the.api.Api - http://servername:99999/chargeit?session_id=1d7cb257e22946abbb3a14b17f232505&manage_event=first&wantThisUser=4119057000083&_source=operator3 b90e7798-8abd-4cf4-9660-45d6527e2804 request:
 HEADERS:
  this-is-a-header: 200
  Want-To-Have-This: *123*200#
  Host: servername:99999
  Accept: */*
  User-Agent: AHC/2.0
  Timeout-Access: <function1>
 CONTENT:

2016-12-16 20:43:47 DEBUG[ispatcher-12571] this.is.the.api.Api - http://servername:99999/chargeit?session_id=20111&manage_event=first&wantThisUser=4119185011005&_operator=operator4 926fa104-e72f-46e8-a5fc-912ef9707a01 request:
 HEADERS:
  this-is-a-header: 0
  Want-To-Have-This: *123*0#

And it pulls the timestamp and the other two groups from the line(s) above the desired match. Please how do I restrict the match to its own line? Or am I approaching this the wrong way?


Solution

  • Thanks, @blubberdibulb! You helped me narrow down my block matching regex to first_match = re.compile(r"^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}.*?(?=^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}|\Z)", re.DOTALL|re.MULTILINE) which makes more manageable chunks to parse. Everything's working much better now.