Search code examples
pythonpandasimportdelimiterseparator

Unable to separate data with ',' separator while importing text file


My data is having a repetitive pattern:


2021-11-17 10:59:10.880
SysState: 4, Events: 161403, EMS: 4, VDB: 2, TubeState: 0x02
BDR Mode: 1, BMS Ext: 2, BMS Int: 0, BdrStat: 00
CPU(%):16, CPUmax(%):47, task idx:3, CPUmaxIRQ(%):0
SOC:9973, SOH:100, LV?:0, HV?:0
3330mV 3333mV 3332mV 3332mV 3331mV 0 0
3331mV 3324mV 3325mV 3325mV 3328mV 0 0
3325mV 3321mV 3328mV 3328mV 3327mV 0 0
3329mV 0mV 0mV 0mV 0mV 0 0
BPV:53288, PLV:53241, BPC:0, PLC:0
AMBI:421, CONN:278, FETS:282, BMSC:274, BPA1:259, BPA2:237, BPA3:255
2021-11-17 10:59:13.80
SysState: 4, Events: 161407, EMS: 4, VDB: 3, TubeState: 0x08
BDR Mode: 4, BMS Ext: 3, BMS Int: 1, BdrStat: 00
CPU(%):12, CPUmax(%):47, task idx:3, CPUmaxIRQ(%):0
SOC:9973, SOH:100, LV?:0, HV?:0
3332mV 3331mV 3332mV 3332mV 3331mV 0 0
3331mV 3324mV 3325mV 3326mV 3328mV 0 0
3324mV 3321mV 3328mV 3328mV 3327mV 0 0
3329mV 0mV 0mV 0mV 0mV 0 0
BPV:53288, PLV:53277, BPC:23, PLC:0
AMBI:421, CONN:278, FETS:282, BMSC:276, BPA1:259, BPA2:237, BPA3:255
2021-11-17 10:59:15.280
SysState: 4, Events: 161407, EMS: 4, VDB: 3, TubeState: 0x08
BDR Mode: 4, BMS Ext: 3, BMS Int: 1, BdrStat: 00
CPU(%):11, CPUmax(%):47, task idx:3, CPUmaxIRQ(%):0
SOC:9973, SOH:100, LV?:0, HV?:0
3331mV 3332mV 3331mV 3332mV 3331mV 0 0
3331mV 3324mV 3325mV 3325mV 3328mV 0 0
3324mV 3322mV 3328mV 3328mV 3327mV 0 0
3331mV 0mV 0mV 0mV 0mV 0 0
BPV:53288, PLV:53259, BPC:47, PLC:47
AMBI:421, CONN:278, FETS:282, BMSC:276, BPA1:259, BPA2:237, BPA3:255




What I want to do is separate every value and make it a column starting from '2021-11-17 10:59:10.880' to 'BPA3:255'

Index Another header Another header Another header
0 2021-11-17 10:59:10.880 SysState: 4 Events: 161403
1 2021-11-17 10:59:13.80 SysState: 4 Events: 1161407

so on and so forth..

what is have done so far:


The file was a .txt file and I converted it into csv first and then:

df = pd.read_csv('data.csv', sep=',' )


but it gives me ParserError: Error tokenizing data. Anybody knows how to solve this problem? with the sep= ';' or changing the text file to csv gives me the following output:

enter image description here

Is there a way to resolve this while parsing text file and not converting it into csv?


Solution

  • Assuming you want only the date, the SysState and Events, an easy way is to extract the info using a regex.

    I also assumed the file is not huge so I am loading all in memory, if this is not the case then you'll have to parse line by line.

    with open('filename.csv') as f:
        lines = f.read()
    
    import re
    regex = re.compile('(\d{4}-\d\d-\d\d \d\d:\d\d:\d\d\.\d+)\nSysState: (\d+),\s+Events: (\d+).+?')
    
    df = pd.DataFrame(regex.findall(lines), columns=['datetime', 'SysState', 'Events'])
    

    NB. I extracted only the numbers from the fields, but if you really want to have SysState: 4, etc. it is easy to add it in the capturing group

    output:

                      datetime SysState  Events
    0  2021-11-17 10:59:10.880        4  161403
    1   2021-11-17 10:59:13.80        4  161407
    2  2021-11-17 10:59:15.280        4  161407