Search code examples
pythonpython-2.7text-parsing

parsing data tagged with ANSI color escape sequences


need help with converting a log file with data tagged with ANSI color escape sequences and date time stamps. Here is the format for lines in the text:

'\x1b[34m[SOME_INFO]\x1b[0m \x1b[36m[SOME_OTHR_INFO]\x1b[0m Thu Sep 09 00:59:12 XST some variable length message which might contain commas (,), etc.'

I am on an isolated network with almost no access to Internet and using Python 2.7.

I have wasted a few hours :(. The closest I got to is using @Elliot Chance's solution

re.sub(r'\x1b\[[\d;]+m', '', s)

provided here Filtering out ANSI escape sequences as follows:

t = re.sub(r'\x1b\[[\d;]+m', '~', s)
re.split(r'~|(Mon|Tue|Wed|Thu|Fri|Sat|Sun.*?\d{4})', t)

which doesn't give me what I want. The output from above code:

['',
 None,
 '[SOME_INFO]',
 None,
 ' ',
 None,
 '[SOME_OTHR_INFO]',
 None,
 ' ',
 'Thu',
 ' Sep 09 00:59:12 XST some variable length message which might contain commas (,), etc.']

The output I am looking for is as follows:

'SOME_INFO, SOME_OTHR_INFO, Thu Sep 09 00:59:12 XST, some variable length message which might contain commas (,), etc.

Is there a way to load the data to a pandas dataframe using pandas.read_csv() or similar?

Note: Every line starts with an escape code but there could be variable fields in each line (i.e., SOME_INFO, SOME_OTHR_INFO, ANOTHER_INFO, etc. followed by the timestamp followed by free text).


Solution

  • The following did the job for me:

    import re
    import pandas as pd
    
    def split_line(s):
        t = re.sub(r'\x1b\[[\d]+m', '~', s) #assume ~ is not present in the free text field 
        t = re.sub('~\s+~|~\s+), '~', s)
        return filter(None, re.split('~|(\D{3}\s\D{3}\s\d{2}.*\d{4})\s+', t))
    

    Next steps:

    • Read the file into a single column dataframe using
    df = pd.read_csv(file_name, header=None, sep='\n', engine='python', index_col=False)
    
    • Apply the above function to each row of the dataframe above. I had trouble applying the pd.apply() so I ended up using the list comprehension method instead
    col_names = ['A', 'B', 'C', 'D']
    df = pd.DataFrame([split_line(str(s)) for s in df[0], columns=col_names]
    df.head()
    
    • Finally write the file to a csv using df.to_csv()