Search code examples
pythonregexdataframeseparator

Convert text with no separators in Dataframe in Python?


So I have a lot of logs txt files that look somewhat like this:

2021-04-01T12:54:38.156Z START RequestId: 123 Version: $LATEST

2021-04-01T12:54:42.356Z END RequestId: 123

2021-04-01T12:54:42.356Z REPORT RequestId: 123  Duration: 4194.14 ms    Billed Duration: 4195 ms    Memory Size: 2048 MB    Max Memory Used: 608 MB 

I need to create a pandas dataframe with this data with following features where each row would present one log:

DateTime, Keyword(start/end), RequestId, Duration, BilledDuration, MemorySize, MaxMemoryUsed

The problem is that each file has different length and there are different types of logs so not every line looks the same but there are patterns. I've never used RegEx but I think this is what I have to use. So is there a way to transform this string into a dataset?

(my goal is to perform memory usage anomaly detection)


Solution

  • So apparently I'm still bad at asking right question on this website but gladly a bit better at finding solutions by myself so if somebody else has the same problem this is what I did:

    import re
    import gzip
    
    counter = 0
    
    for file in file_list:
        # open and read
        file_content = gzip.open(file, 'rb').read().decode("utf-8")
        
        # split file in lines
        splitted_file_content = file_content.splitlines()
        for line in splitted_file_content:
            # look for the report lines
            if re.search('REPORT', line):
                tokens = line.split()
        
                timestamp = tokens[0]
                id = tokens[3]
                billed_duration = tokens[9]
                max_memory_size_used = tokens[18]
                init_duration = tokens[22]
                
                # if you want to pack it in a dataframe
                df.loc[counter] = [timestamp, id, billed_duration,
                                   max_memory_size_used, init_duration]
                counter += 1