Search code examples
pythonregexregex-groupregex-look-ahead

Parsing web log file using regex in python


I have a web log file that contains numeric host data and alpha-numeric username data. Here are a few lines from the log file:

189.254.43.43 - swift6867 [21/Jun/2019:15:53:00 -0700] "GET /architectures/recontextualize/morph/scale HTTP/1.0" 204 8976
20.80.28.12 - hagenes4423 [21/Jun/2019:15:53:01 -0700] "POST /harness HTTP/1.1" 404 28127
112.211.50.38 - - [21/Jun/2019:15:53:03 -0700] "DELETE /harness/e-business/functionalities HTTP/1.1" 405 7975

Sometimes, the username is replaced with a hyphen.

I want to extract only the data before the first square bracket, which is then to be converted into a list of dictionaries. For example:

example_dict = {"host":"189.254.43.43", 
                "user_name":"swift6867"}

This is the regex that I used:

pattern = """
    (?P<host>[\d]*[.][\d]*[.][\d]*[.][\d]*)     # host dictionary
    (?P<username>([\w]+|-)(?=\ \[))             # username dictionary 
"""

re.finditer(pattern,logdata,re.VERBOSE)

The regex doesn't find any matches. Only individual regex statements work. By this I mean the regex for host dictionary will work if I comment out the regex for username dictionary, and vice versa.

What am I doing wrong?


Solution

  • You can use next regex (demo):

    ^(?P<host>(?:\d+\.?){4})\s*-\s*(?P<user_name>[^\s-]*?)\s
    

    To create list of dicts you can apply groupdict() on each Match object returned by finditer():

    import re
    ...
    pattern = r'^(?P<host>(?:\d+\.?){4})\s*-\s*(?P<user_name>[^\s-]*?)\s'
    result = [i.groupdict() for i in re.finditer(pattern, logdata, re.MULTILINE)]
    

    A bit less steps will take this regex (demo), so on bigger data it should be slightly faster:

    ^(?P<host>\d+\.\d+\.\d+\.\d+)\s*-\s*(?P<user_name>[^\s-]*?)\s