Search code examples
pythonpandastokenize

Python Pandas Error tokenizing data: How to avoid error caused by different length


I am trying to read a *.dat file with pandas read_csv function.

df = pd.read_csv(file, skiprows=0, header=None, sep=" ", parse_dates=[[0, 1]])

The data looks like this:

2019-06-01 04:00:22 PW  100  2000  2000 /// // // // ////// ////// ////
2019-06-01 04:00:32 PW  100  2000  2000 /// // // // ////// ////// ////
2019-06-01 04:00:42 PW  100  2000  2000 /// // // // ////// ////// ////
2019-06-01 04:00:52 PW  100  2000  2000 /// // // // ////// ////// ////
2019-06-01 04:01:02 PW  100  2000  2000 /// // // // ////// ////// ////
2019-06-01 04:01:12 PW  100  2000  2000 /// // // // ////// ////// ////
2019-06-01 04:01:22 PW  100  2000  2000 /// // // // ////// ////// ////
2019-06-01 04:01:32 PW  100  2000  2000 /// // // // ////// ////// ////

I get a Tokenizing Error:

ParserError: Error tokenizing data. C error: Expected 16 fields in line 242, saw 17

I think this error is caused, because in line 242 the values in column 6 are lower than in the lines before, e.g. column 6 stays at 2000 or has values with 4 digits (e.g. 1501), but in line 242 it drops to 991 (three digits).

2019-06-01 04:39:32 PW  100  2000  2000 /// // // // ////// ////// ////
2019-06-01 04:39:42 PW  100  1501  2000 /// // // // ////// ////// ////
2019-06-01 04:39:52 PW  100  1501  2000 /// // // // ////// ////// ////
2019-06-01 04:40:02 PW  100  1501  2000 /// // // // ////// ////// ////
2019-06-01 04:40:12 PW  100  1187  2000 /// // // // ////// ////// ////
2019-06-01 04:40:22 PW  100  1187  2000 /// // // // ////// ////// ////
2019-06-01 04:40:32 PW  100   991  2000 /// // // // ////// ////// ////

How can I get rid of this error?

error_bad_lines=False is not an option, because I need exactly these values


Solution

  • You should use sep=" +" or sep="\s+" instead of sep=" ". With the latter, multiple blanks are separated into multiple empty columns which causes the error when the number of blanks changes.

    As an alternative, you could specifiy delim_whitespace=True instead of sep or delimiter.