Search code examples
pythonpandasread-csv

How to read log files in Pandas dataframe having multiple delimiters?


My log file format is as below :

2016-09-28 04:30:30, Info                  CBS    Loaded Servicing Stack v6.1.7601.23505 with Core: C:\Windows\winsxs\amd64_microsoft-windows-servicingstack_31bf3856ad364e35_6.1.7601.23505_none_681aa442f6fed7f0\cbscore.dll
2016-09-28 04:30:31, Info                  CSI    00000001@2016/9/27:20:30:31.455 WcpInitialize (wcp.dll version 0.0.0.6) called (stack @0x7fed806eb5d @0x7fef9fb9b6d @0x7fef9f8358f @0xff83e97c @0xff83d799 @0xff83db2f)
2016-09-28 04:30:31, Info                  CSI    00000002@2016/9/27:20:30:31.458 WcpInitialize (wcp.dll version 0.0.0.6) called (stack @0x7fed806eb5d @0x7fefa006ade @0x7fef9fd2984 @0x7fef9f83665 @0xff83e97c @0xff83d799)
2016-09-28 04:30:31, Info                  CSI    00000003@2016/9/27:20:30:31.458 WcpInitialize (wcp.dll version 0.0.0.6) called (stack @0x7fed806eb5d @0x7fefa1c8728 @0x7fefa1c8856 @0xff83e474 @0xff83d7de @0xff83db2f)
2016-09-28 04:30:31, Info                  CBS    Ending TrustedInstaller initialization.
2016-09-28 04:30:31, Info                  CBS    Starting the TrustedInstaller main loop.
2016-09-28 04:30:31, Info                  CBS    TrustedInstaller service starts successfully.
2016-09-28 04:30:31, Info                  CBS    SQM: Initializing online with Windows opt-in: False
2016-09-28 04:30:31, Info                  CBS    SQM: Cleaning up report files older than 10 days.
2016-09-28 04:30:31, Info                  CBS    SQM: Requesting upload of all unsent reports.

Taken from here https://github.com/logpai/loghub/blob/master/Windows/Windows_2k.log

I want to read this file in a pandas dataframe with proper column headings. It is having multiple delimeters as , space and tab. Please share your thoughts with a running code sample.


Solution

  • read_csv allows to use regex in sep= so you can use | (as OR) to define many chars (or even strings) as delimiters.

    Because there are spaces in text which shouldn't be used as separator so I use , (comma + space) and \s{2,} (2 or more whitespaces) to detect correct places (but this keeps date and time as single string):

    sep=r', |\s{2,}', engine='python'
    

    Minimal working code.

    I use io only to create file like object in memory so everyone can simply copy and test it - but you could use filename (or even directly url to raw data read_csv("https://raw.githubusercontent.com/logpai/loghub/refs/heads/master/Windows/Windows_2k.log", ...))

    text = r'''
    2016-09-28 04:30:30, Info                  CBS    Loaded Servicing Stack v6.1.7601.23505 with Core: C:\Windows\winsxs\amd64_microsoft-windows-servicingstack_31bf3856ad364e35_6.1.7601.23505_none_681aa442f6fed7f0\cbscore.dll
    2016-09-28 04:30:31, Info                  CSI    00000001@2016/9/27:20:30:31.455 WcpInitialize (wcp.dll version 0.0.0.6) called (stack @0x7fed806eb5d @0x7fef9fb9b6d @0x7fef9f8358f @0xff83e97c @0xff83d799 @0xff83db2f)
    2016-09-28 04:30:31, Info                  CSI    00000002@2016/9/27:20:30:31.458 WcpInitialize (wcp.dll version 0.0.0.6) called (stack @0x7fed806eb5d @0x7fefa006ade @0x7fef9fd2984 @0x7fef9f83665 @0xff83e97c @0xff83d799)
    2016-09-28 04:30:31, Info                  CSI    00000003@2016/9/27:20:30:31.458 WcpInitialize (wcp.dll version 0.0.0.6) called (stack @0x7fed806eb5d @0x7fefa1c8728 @0x7fefa1c8856 @0xff83e474 @0xff83d7de @0xff83db2f)
    2016-09-28 04:30:31, Info                  CBS    Ending TrustedInstaller initialization.
    2016-09-28 04:30:31, Info                  CBS    Starting the TrustedInstaller main loop.
    2016-09-28 04:30:31, Info                  CBS    TrustedInstaller service starts successfully.
    2016-09-28 04:30:31, Info                  CBS    SQM: Initializing online with Windows opt-in: False
    2016-09-28 04:30:31, Info                  CBS    SQM: Cleaning up report files older than 10 days.
    2016-09-28 04:30:31, Info                  CBS    SQM: Requesting upload of all unsent reports.
    '''
    
    import pandas as pd
    import io
    
    file_like_object = io.StringIO(text)
    
    df = pd.read_csv(file_like_object, sep=r', |\s{2,}', engine='python', names=['date', 'info', 'other', 'text'])
    
    print(df)
    

    Result:

                      date  info other                                               text
    0  2016-09-28 04:30:30  Info   CBS  Loaded Servicing Stack v6.1.7601.23505 with Co...
    1  2016-09-28 04:30:31  Info   CSI  00000001@2016/9/27:20:30:31.455 WcpInitialize ...
    2  2016-09-28 04:30:31  Info   CSI  00000002@2016/9/27:20:30:31.458 WcpInitialize ...
    3  2016-09-28 04:30:31  Info   CSI  00000003@2016/9/27:20:30:31.458 WcpInitialize ...
    4  2016-09-28 04:30:31  Info   CBS            Ending TrustedInstaller initialization.
    5  2016-09-28 04:30:31  Info   CBS           Starting the TrustedInstaller main loop.
    6  2016-09-28 04:30:31  Info   CBS      TrustedInstaller service starts successfully.
    7  2016-09-28 04:30:31  Info   CBS  SQM: Initializing online with Windows opt-in: ...
    8  2016-09-28 04:30:31  Info   CBS  SQM: Cleaning up report files older than 10 days.
    9  2016-09-28 04:30:31  Info   CBS      SQM: Requesting upload of all unsent reports.
    

    Other idea: pandas has function read_fwf to read fixed-width formatted data - and you can use width of columns to split it in correct way.

    It seems it automatically splits your data but it needs to remove , from date after reading. But it also splits date and time as separate columns. It treats date (without time) as index.

    df = pd.read_fwf(file_like_object, names=['date', 'info', 'other', 'text'])
    
    df['date'] = df['date'].str.rstrip(',')
    
    print(df)
    

    Using widths you can resolve this problem - but this needs to calculate all widths manually. And this still needs to remove ,

    df = pd.read_fwf(file_like_object, names=['date', 'info', 'other', 'text'], widths=[21,22,7,1000])
    
    df['date'] = df['date'].str.rstrip(',')
    

    Probably using colspecs you can define [start,end[ for every column and skip , but this still need to calculate values start,end manually.


    Other idea: sometimes it is simpler to read it as normal text and use line.split() or slicing line[start:end] (and other string functions, and also if/elif/else) to split columns and later use DataFrame(data). It can be useful when file has more complex data (ie. text in many lines)


    Doc: read_csv, read_fwf