Search code examples
pythonpandaswindowscsvparse-error

ParserError: Expected 2 fields in line 32, saw 4


I'm having trouble parsing a txt file (see here: File) Here's my code

import pandas as pd
objectname = r"path"
df = pd.read_csv(objectname, engine = 'python',  sep='\t', header=None)

Unfortunately it does not work. Since this question has been asked several times, I tried lots of proposed solutions (most of them can be found here: Possible solutions)

However, nothing did the trick for me. For instance, when I use

sep='delimiter'

The dataframe is created but everything ends up in a single column.

When I use

error_bad_lines=False

The rows I'm interested in are simply skipped.

The only way it works is when I first open the txt file, copy the content, paste it into google sheets, save the file as CSV and then open the dataframe.

I guess another workaround would be to use

df = pd.read_csv(objectname, engine = 'python',  sep = 'delimiter', header=None)

in combination with the split function Split function

Is there any suggestion how to make this work without the need to convert the file or to use the split function? I'm using Python 3 and Windows 10. Any help is appreciated.


Solution

  • Your file has tab separators but is not a TSV. The file is a mixture of metadata, followed by a "standard" TSV, followed by more metadata. Therefore, I found tackling the metadata as a separate task from loading the data to be useful.

    Here's what I did to extract the metadata lines:

    with open('example.txt','r') as file_handle:
        file_content = file_handle.read().split('\n')
    
    for index, line in enumerate(file_content):
        if index<21 or index>37:
            print(index, line.split('\t'))
    

    Note that the lines denoting the start and stop of metadata (21 and 37 in my example) are specific to the file. I've provided the trimmed data I used below (based on your linked file).

    Separately, I loaded the TSV into Pandas using

    import pandas as pd
    df = pd.read_csv('example.txt', engine = 'python',  
                     sep='\t', error_bad_lines=False, header=None,
                     skiprows=list(range(21))+list(range(37,89)))
    

    Again, I skipped the metadata at the start of the file and at the end of the file.

    Here's the file I experimented with. I've trimmed the extra data to reduce line count.

    TITLE   Test123
    DATA TYPE   
    ORIGIN  JASCO
    OWNER   
    DATE    19/03/28
    TIME    16:39:44
    SPECTROMETER/DATA SYSTEM    
    LOCALE  1031
    RESOLUTION  
    DELTAX  -0,5
    XUNITS  NANOMETERS
    YUNITS  CD [mdeg]
    Y2UNITS HT [V]
    Y3UNITS ABSORBANCE
    FIRSTX    300,0000
    LASTX     190,0000
    NPOINTS      221
    FIRSTY      -0,78961
    MAXY        37,26262
    MINY       -53,38971
    XYDATA
    300,0000    -0,789606   182,198 -0,0205245
    299,5000    -0,691644   182,461 -0,0181217
    299,0000    -0,700976   182,801 -0,0136756
    298,5000    -0,614708   182,799 -0,0131957
    298,0000    -0,422611   182,783 -0,0130073
    195,0000    26,6231 997,498 4,7258
    194,5000    -17,3049    997,574 4,6864
    194,0000    16,0387 997,765 4,63967
    193,5000    -14,4049    997,967 4,58593
    193,0000    -0,277261   998,025 4,52411
    192,5000    -29,6098    998,047 4,45244
    192,0000    -11,5786    998,097 4,36608
    191,5000    34,0505 998,282 4,27376
    191,0000    28,2325 998,314 4,1701
    190,5000    -13,232 998,336 4,05036
    190,0000    -47,023 998,419 3,91883
    
    ##### Extended Information
    [Comments]
    Sample name X
    Comment
    User
    Division
    Company RWTH Aachen
    
    [Detailed Information]
    Creation date   28.03.2019 16:39
    
    Data array type Linear data array * 3
    Horizontal axis Wavelength [nm]
    Vertical axis(1)    CD [mdeg]
    Vertical axis(2)    HT [V]
    Vertical axis(3)    Abs
    Start   300 nm
    End 190 nm
    Data interval   0,5 nm
    Data points 221     
    
    [Measurement Information]
    Instrument name CD-Photometer
    Model name  J-1100
    Serial No.  A001361635
    
    Detector    Standard PMT
      Lock-in amp.  X mode
      HT volt   Auto
    
    Accessory   PTC-514
    Accessory S/N   A000161648
      Temperature   18.63 C
      Control sonsor    Holder
      Monitor sensor    Holder
    
    Measurement date    28.03.2019 16:39
    
    Overload detect 203
    Photometric mode    CD, HT, Abs
    Measure range   300 - 190 nm
    Data pitch  0.5 nm
    CD scale    2000 mdeg/1.0 dOD
    FL scale    200 mdeg/1.0 dOD
    D.I.T.  0.5 sec
    Bandwidth   1.00 nm
    Start mode  Immediately
    Scanning speed  200 nm/min
    Baseline correction Baseline
    Shutter control Auto
    Accumulations   3