Search code examples
pythonregexparsinglogfile-analysis

best way to parsing Large files by regex python


I have to parse a large log file (2GB) using reg ex in python. In the log file regular expression matches line which I am interested in. Log file can also have unwanted data.

Here is a sample from the file:

"#DEBUG:: BFM [L4] 5.4401e+08ps MSG DIR:TX SCB_CB TYPE:DATA_REQ CPortID:'h8 SIZE:'d20 NumSeg:'h0001 Msg_Id:'h00000000"

My regular expression is ".DEBUG.*MSG."

First I will split it using the white spaces then the "field:value" patterns are inserted into the sqlite3 database; but for large files it takes around 10 to 15 minutes to parse the file.

Please suggest the best way to do the above task in minimal time.


Solution

  • As others have said, profile your code to see why it is slow. The cProfile module in conjunction with the gprof2dot tool can produce nice readable information

    Without seeing your slow code, I can guess a few things that might help:

    First is you can probably get away with using the builtin string methods instead of a regex - this might be marginally quicker. If you need to use regex's, it's worthwhile precompiling outside the main loop using re.compile

    Second is to not do one insert query per line, instead do the insertions in batches, e.g add the parsed info to a list, then when it reaches a certain size, perform one INSERT query with executemany method.

    Some incomplete code, as an example of the above:

    import fileinput
    
    parsed_info = []
    for linenum, line in enumerate(fileinput.input()):
        if not line.startswith("#DEBUG"):
            continue # Skip line
    
        msg = line.partition("MSG")[1] # Get everything after MSG
        words = msg.split() # Split on words
        info = {}
        for w in words:
            k, _, v = w.partition(":") # Split each word on first :
            info[k] = v
    
        parsed_info.append(info)
    
        if linenum % 10000 == 0: # Or maybe  if len(parsed_info) > 500:
            # Insert everything in parsed_info to database
            ...
            parsed_info = [] # Clear