How to read log files in Pandas dataframe having multiple delimiters?

My log file format is as below :

2016-09-28 04:30:30, Info                  CBS    Loaded Servicing Stack v6.1.7601.23505 with Core: C:\Windows\winsxs\amd64_microsoft-windows-servicingstack_31bf3856ad364e35_6.1.7601.23505_none_681aa442f6fed7f0\cbscore.dll
2016-09-28 04:30:31, Info                  CSI    00000001@2016/9/27:20:30:31.455 WcpInitialize (wcp.dll version 0.0.0.6) called (stack @0x7fed806eb5d @0x7fef9fb9b6d @0x7fef9f8358f @0xff83e97c @0xff83d799 @0xff83db2f)
2016-09-28 04:30:31, Info                  CSI    00000002@2016/9/27:20:30:31.458 WcpInitialize (wcp.dll version 0.0.0.6) called (stack @0x7fed806eb5d @0x7fefa006ade @0x7fef9fd2984 @0x7fef9f83665 @0xff83e97c @0xff83d799)
2016-09-28 04:30:31, Info                  CSI    00000003@2016/9/27:20:30:31.458 WcpInitialize (wcp.dll version 0.0.0.6) called (stack @0x7fed806eb5d @0x7fefa1c8728 @0x7fefa1c8856 @0xff83e474 @0xff83d7de @0xff83db2f)
2016-09-28 04:30:31, Info                  CBS    Ending TrustedInstaller initialization.
2016-09-28 04:30:31, Info                  CBS    Starting the TrustedInstaller main loop.
2016-09-28 04:30:31, Info                  CBS    TrustedInstaller service starts successfully.
2016-09-28 04:30:31, Info                  CBS    SQM: Initializing online with Windows opt-in: False
2016-09-28 04:30:31, Info                  CBS    SQM: Cleaning up report files older than 10 days.
2016-09-28 04:30:31, Info                  CBS    SQM: Requesting upload of all unsent reports.

Taken from here https://github.com/logpai/loghub/blob/master/Windows/Windows_2k.log

I want to read this file in a pandas dataframe with proper column headings. It is having multiple delimeters as , space and tab. Please share your thoughts with a running code sample.

Solution

read_csv allows to use regex in sep= so you can use | (as OR) to define many chars (or even strings) as delimiters.

Because there are spaces in text which shouldn't be used as separator so I use , (comma + space) and \s{2,} (2 or more whitespaces) to detect correct places (but this keeps date and time as single string):

sep=r', |\s{2,}', engine='python'

Minimal working code.

I use io only to create file like object in memory so everyone can simply copy and test it - but you could use filename (or even directly url to raw data read_csv("https://raw.githubusercontent.com/logpai/loghub/refs/heads/master/Windows/Windows_2k.log", ...))

text = r'''
2016-09-28 04:30:30, Info                  CBS    Loaded Servicing Stack v6.1.7601.23505 with Core: C:\Windows\winsxs\amd64_microsoft-windows-servicingstack_31bf3856ad364e35_6.1.7601.23505_none_681aa442f6fed7f0\cbscore.dll
2016-09-28 04:30:31, Info                  CSI    00000001@2016/9/27:20:30:31.455 WcpInitialize (wcp.dll version 0.0.0.6) called (stack @0x7fed806eb5d @0x7fef9fb9b6d @0x7fef9f8358f @0xff83e97c @0xff83d799 @0xff83db2f)
2016-09-28 04:30:31, Info                  CSI    00000002@2016/9/27:20:30:31.458 WcpInitialize (wcp.dll version 0.0.0.6) called (stack @0x7fed806eb5d @0x7fefa006ade @0x7fef9fd2984 @0x7fef9f83665 @0xff83e97c @0xff83d799)
2016-09-28 04:30:31, Info                  CSI    00000003@2016/9/27:20:30:31.458 WcpInitialize (wcp.dll version 0.0.0.6) called (stack @0x7fed806eb5d @0x7fefa1c8728 @0x7fefa1c8856 @0xff83e474 @0xff83d7de @0xff83db2f)
2016-09-28 04:30:31, Info                  CBS    Ending TrustedInstaller initialization.
2016-09-28 04:30:31, Info                  CBS    Starting the TrustedInstaller main loop.
2016-09-28 04:30:31, Info                  CBS    TrustedInstaller service starts successfully.
2016-09-28 04:30:31, Info                  CBS    SQM: Initializing online with Windows opt-in: False
2016-09-28 04:30:31, Info                  CBS    SQM: Cleaning up report files older than 10 days.
2016-09-28 04:30:31, Info                  CBS    SQM: Requesting upload of all unsent reports.
'''

import pandas as pd
import io

file_like_object = io.StringIO(text)

df = pd.read_csv(file_like_object, sep=r', |\s{2,}', engine='python', names=['date', 'info', 'other', 'text'])

print(df)

Result:

                  date  info other                                               text
0  2016-09-28 04:30:30  Info   CBS  Loaded Servicing Stack v6.1.7601.23505 with Co...
1  2016-09-28 04:30:31  Info   CSI  00000001@2016/9/27:20:30:31.455 WcpInitialize ...
2  2016-09-28 04:30:31  Info   CSI  00000002@2016/9/27:20:30:31.458 WcpInitialize ...
3  2016-09-28 04:30:31  Info   CSI  00000003@2016/9/27:20:30:31.458 WcpInitialize ...
4  2016-09-28 04:30:31  Info   CBS            Ending TrustedInstaller initialization.
5  2016-09-28 04:30:31  Info   CBS           Starting the TrustedInstaller main loop.
6  2016-09-28 04:30:31  Info   CBS      TrustedInstaller service starts successfully.
7  2016-09-28 04:30:31  Info   CBS  SQM: Initializing online with Windows opt-in: ...
8  2016-09-28 04:30:31  Info   CBS  SQM: Cleaning up report files older than 10 days.
9  2016-09-28 04:30:31  Info   CBS      SQM: Requesting upload of all unsent reports.

Other idea: pandas has function read_fwf to read fixed-width formatted data - and you can use width of columns to split it in correct way.

It seems it automatically splits your data but it needs to remove , from date after reading. But it also splits date and time as separate columns. It treats date (without time) as index.

df = pd.read_fwf(file_like_object, names=['date', 'info', 'other', 'text'])

df['date'] = df['date'].str.rstrip(',')

print(df)

Using widths you can resolve this problem - but this needs to calculate all widths manually. And this still needs to remove ,

df = pd.read_fwf(file_like_object, names=['date', 'info', 'other', 'text'], widths=[21,22,7,1000])

df['date'] = df['date'].str.rstrip(',')

Probably using colspecs you can define [start,end[ for every column and skip , but this still need to calculate values start,end manually.

Other idea: sometimes it is simpler to read it as normal text and use line.split() or slicing line[start:end] (and other string functions, and also if/elif/else) to split columns and later use DataFrame(data). It can be useful when file has more complex data (ie. text in many lines)

Doc: read_csv, read_fwf