My log file format is as below :
2016-09-28 04:30:30, Info CBS Loaded Servicing Stack v6.1.7601.23505 with Core: C:\Windows\winsxs\amd64_microsoft-windows-servicingstack_31bf3856ad364e35_6.1.7601.23505_none_681aa442f6fed7f0\cbscore.dll
2016-09-28 04:30:31, Info CSI 00000001@2016/9/27:20:30:31.455 WcpInitialize (wcp.dll version 0.0.0.6) called (stack @0x7fed806eb5d @0x7fef9fb9b6d @0x7fef9f8358f @0xff83e97c @0xff83d799 @0xff83db2f)
2016-09-28 04:30:31, Info CSI 00000002@2016/9/27:20:30:31.458 WcpInitialize (wcp.dll version 0.0.0.6) called (stack @0x7fed806eb5d @0x7fefa006ade @0x7fef9fd2984 @0x7fef9f83665 @0xff83e97c @0xff83d799)
2016-09-28 04:30:31, Info CSI 00000003@2016/9/27:20:30:31.458 WcpInitialize (wcp.dll version 0.0.0.6) called (stack @0x7fed806eb5d @0x7fefa1c8728 @0x7fefa1c8856 @0xff83e474 @0xff83d7de @0xff83db2f)
2016-09-28 04:30:31, Info CBS Ending TrustedInstaller initialization.
2016-09-28 04:30:31, Info CBS Starting the TrustedInstaller main loop.
2016-09-28 04:30:31, Info CBS TrustedInstaller service starts successfully.
2016-09-28 04:30:31, Info CBS SQM: Initializing online with Windows opt-in: False
2016-09-28 04:30:31, Info CBS SQM: Cleaning up report files older than 10 days.
2016-09-28 04:30:31, Info CBS SQM: Requesting upload of all unsent reports.
Taken from here https://github.com/logpai/loghub/blob/master/Windows/Windows_2k.log
I want to read this file in a pandas dataframe with proper column headings. It is having multiple delimeters as ,
space
and tab
.
Please share your thoughts with a running code sample.
read_csv
allows to use regex
in sep=
so you can use |
(as OR
) to define many chars (or even strings) as delimiters.
Because there are spaces in text which shouldn't be used as separator so I use ,
(comma + space) and \s{2,}
(2 or more whitespaces) to detect correct places (but this keeps date and time as single string):
sep=r', |\s{2,}', engine='python'
Minimal working code.
I use io
only to create file like object
in memory so everyone can simply copy and test it - but you could use filename (or even directly url to raw data read_csv("https://raw.githubusercontent.com/logpai/loghub/refs/heads/master/Windows/Windows_2k.log", ...)
)
text = r'''
2016-09-28 04:30:30, Info CBS Loaded Servicing Stack v6.1.7601.23505 with Core: C:\Windows\winsxs\amd64_microsoft-windows-servicingstack_31bf3856ad364e35_6.1.7601.23505_none_681aa442f6fed7f0\cbscore.dll
2016-09-28 04:30:31, Info CSI 00000001@2016/9/27:20:30:31.455 WcpInitialize (wcp.dll version 0.0.0.6) called (stack @0x7fed806eb5d @0x7fef9fb9b6d @0x7fef9f8358f @0xff83e97c @0xff83d799 @0xff83db2f)
2016-09-28 04:30:31, Info CSI 00000002@2016/9/27:20:30:31.458 WcpInitialize (wcp.dll version 0.0.0.6) called (stack @0x7fed806eb5d @0x7fefa006ade @0x7fef9fd2984 @0x7fef9f83665 @0xff83e97c @0xff83d799)
2016-09-28 04:30:31, Info CSI 00000003@2016/9/27:20:30:31.458 WcpInitialize (wcp.dll version 0.0.0.6) called (stack @0x7fed806eb5d @0x7fefa1c8728 @0x7fefa1c8856 @0xff83e474 @0xff83d7de @0xff83db2f)
2016-09-28 04:30:31, Info CBS Ending TrustedInstaller initialization.
2016-09-28 04:30:31, Info CBS Starting the TrustedInstaller main loop.
2016-09-28 04:30:31, Info CBS TrustedInstaller service starts successfully.
2016-09-28 04:30:31, Info CBS SQM: Initializing online with Windows opt-in: False
2016-09-28 04:30:31, Info CBS SQM: Cleaning up report files older than 10 days.
2016-09-28 04:30:31, Info CBS SQM: Requesting upload of all unsent reports.
'''
import pandas as pd
import io
file_like_object = io.StringIO(text)
df = pd.read_csv(file_like_object, sep=r', |\s{2,}', engine='python', names=['date', 'info', 'other', 'text'])
print(df)
Result:
date info other text
0 2016-09-28 04:30:30 Info CBS Loaded Servicing Stack v6.1.7601.23505 with Co...
1 2016-09-28 04:30:31 Info CSI 00000001@2016/9/27:20:30:31.455 WcpInitialize ...
2 2016-09-28 04:30:31 Info CSI 00000002@2016/9/27:20:30:31.458 WcpInitialize ...
3 2016-09-28 04:30:31 Info CSI 00000003@2016/9/27:20:30:31.458 WcpInitialize ...
4 2016-09-28 04:30:31 Info CBS Ending TrustedInstaller initialization.
5 2016-09-28 04:30:31 Info CBS Starting the TrustedInstaller main loop.
6 2016-09-28 04:30:31 Info CBS TrustedInstaller service starts successfully.
7 2016-09-28 04:30:31 Info CBS SQM: Initializing online with Windows opt-in: ...
8 2016-09-28 04:30:31 Info CBS SQM: Cleaning up report files older than 10 days.
9 2016-09-28 04:30:31 Info CBS SQM: Requesting upload of all unsent reports.
Other idea: pandas has function read_fwf
to read fixed-width formatted data - and you can use width of columns to split it in correct way.
It seems it automatically splits your data but it needs to remove ,
from date after reading. But it also splits date and time as separate columns. It treats date (without time) as index.
df = pd.read_fwf(file_like_object, names=['date', 'info', 'other', 'text'])
df['date'] = df['date'].str.rstrip(',')
print(df)
Using widths
you can resolve this problem - but this needs to calculate all widths manually. And this still needs to remove ,
df = pd.read_fwf(file_like_object, names=['date', 'info', 'other', 'text'], widths=[21,22,7,1000])
df['date'] = df['date'].str.rstrip(',')
Probably using colspecs
you can define [start,end[
for every column and skip ,
but this still need to calculate values start
,end
manually.
Other idea: sometimes it is simpler to read it as normal text and use line.split()
or slicing line[start:end]
(and other string functions, and also if/elif/else
) to split columns and later use DataFrame(data)
. It can be useful when file has more complex data (ie. text in many lines)