Search code examples
pythonregexparsingcatalina

How to parse multiple line catalina log in python - regex


I have catalina log:

oct 21, 2016 12:32:13 AM org.wso2.carbon.identity.sso.agent.saml.SSOAgentHttpSessionListener sessionCreated
WARNING: HTTP Session created without LoggedInSessionBean
oct 21, 2016 3:03:20 AM com.sun.jersey.spi.container.ContainerResponse logException
SEVERE: Mapped exception to response: 500 (Internal Server Error)
javax.ws.rs.WebApplicationException
    at ais.api.rest.rdss.Resource.lookAT(Resource.java:22)
    at sun.reflect.GeneratedMethodAccessor3019.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

I try to parse it in python. My problem is that I dont know how many lines there are in log. Minimum are 2 lines. I try read from file and when first line start with j,m,s,o etc. it mean it is first line of log, because this are first letters of months. But I dont know how to continue. When I stop read the lines ? When next line will starts with one of these letters ? But how I do that?

import datetime
import re

SPACE = r'\s'
TIME = r'(?P<time>.*?M)'
PATH = r'(?P<path>.*?\S)'
METHOD = r'(?P<method>.*?\S)'
REQUEST = r'(?P<request>.*)'
TYPE = r'(?P<type>.*?\:)'

REGEX = TIME+SPACE+PATH+SPACE+METHOD+SPACE+TYPE+SPACE+REQUEST

def parser(log_line):
  match = re.search(REGEX,log_line)
    return ( (match.group('time'),
          match.group('path'), 
                              match.group('method'),
                              match.group('type'),
                              match.group('request')
                             )
                           )

db = MySQLdb.connect(host="localhost", user="myuser", passwd="mypsswd", db="Database")

with db:
  cursor = db.cursor()


    with open("Mylog.log","rw") as f:
        for line in f:

          if (line.startswith('j')) or (line.startswith('f')) or (line.startswith('m')) or (line.startswith('a')) or (line.startswith('s')) or (line.startswith('o')) or (line.startswith('n')) or (line.startswith('d')) :

          logLine = line
          result = parser(logLine)

                sql = ("INSERT INTO ..... ")
                data = (result[0])
                cursor.execute(sql, data)

f.close()
db.close()

Best idea I have is read just two lines at a time. But that means discard all another data. There must be better way.

I want read lines like this: 1.line - oct 21, 2016 12:32:13 AM org.wso2.carbon.identity.sso.agent.saml.SSOAgentHttpSessionListener sessionCreated WARNING: HTTP Session created without LoggedInSessionBean

2.line - oct 21, 2016 3:03:20 AM com.sun.jersey.spi.container.ContainerResponse logException SEVERE: Mapped exception to response: 500 (Internal Server Error) javax.ws.rs.WebApplicationException at ais.api.rest.rdss.Resource.lookAT(Resource.java:22) at sun.reflect.GeneratedMethodAccessor3019.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl java:43)

3.line - oct 21, 2016 12:32:13 AM org.wso2.carbon.identity.sso.agent.saml.SSOAgentHttpSessionListener sessionCreated WARNING: HTTP Session created without LoggedInSessionBean

So I want start read when line starts with datetime (this is no problem). Problem is that I want stop read when next line starts with datetime.


Solution

  • This may be what you want.

    I read lines from the log inside a generator so that I can determine whether they are datetime lines or other lines. Also, importantly, I can flag that end-of-file has been reached in the log file.

    In the main loop of the program I start accumulating lines in a list when I get a datetime line. The first time I see a datetime line I print it out if it's not empty. Since the program will have accumulated a complete line when end-of-file occurs I arrange to print the accumulated line at that point too.

    import re
    
    a_date, other, EOF = 0,1,2
    
    def One_line():
        with open('caroline.txt') as caroline:
            for line in caroline:
                line = line.strip()
                m = re.match(r'[a-z]{3}\s+[0-9]{1,2},\s+[0-9]{4}\s+[0-9]{1,2}:[0-9]{2}:[0-9]{2}\s+[AP]M', line, re.I)
                if m:
                    yield a_date, line
                else:
                    yield other, line
        yield EOF, ''
    
    complete_line = []
    for kind, content in One_line():
        if kind in [a_date, EOF]:
            if complete_line:
                print (' '.join(complete_line ))
            complete_line = [content]
        else:
            complete_line.append(content)
    

    Output:

    oct 21, 2016 12:32:13 AM org.wso2.carbon.identity.sso.agent.saml.SSOAgentHttpSessionListener sessionCreated WARNING: HTTP Session created without LoggedInSessionBean
    oct 21, 2016 3:03:20 AM com.sun.jersey.spi.container.ContainerResponse logException SEVERE: Mapped exception to response: 500 (Internal Server Error) javax.ws.rs.WebApplicationException at ais.api.rest.rdss.Resource.lookAT(Resource.java:22) at sun.reflect.GeneratedMethodAccessor3019.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)