I am trying to read a *.dat file with pandas read_csv function.
df = pd.read_csv(file, skiprows=0, header=None, sep=" ", parse_dates=[[0, 1]])
The data looks like this:
2019-06-01 04:00:22 PW 100 2000 2000 /// // // // ////// ////// ////
2019-06-01 04:00:32 PW 100 2000 2000 /// // // // ////// ////// ////
2019-06-01 04:00:42 PW 100 2000 2000 /// // // // ////// ////// ////
2019-06-01 04:00:52 PW 100 2000 2000 /// // // // ////// ////// ////
2019-06-01 04:01:02 PW 100 2000 2000 /// // // // ////// ////// ////
2019-06-01 04:01:12 PW 100 2000 2000 /// // // // ////// ////// ////
2019-06-01 04:01:22 PW 100 2000 2000 /// // // // ////// ////// ////
2019-06-01 04:01:32 PW 100 2000 2000 /// // // // ////// ////// ////
I get a Tokenizing Error:
ParserError: Error tokenizing data. C error: Expected 16 fields in line 242, saw 17
I think this error is caused, because in line 242 the values in column 6 are lower than in the lines before, e.g. column 6 stays at 2000 or has values with 4 digits (e.g. 1501), but in line 242 it drops to 991 (three digits).
2019-06-01 04:39:32 PW 100 2000 2000 /// // // // ////// ////// ////
2019-06-01 04:39:42 PW 100 1501 2000 /// // // // ////// ////// ////
2019-06-01 04:39:52 PW 100 1501 2000 /// // // // ////// ////// ////
2019-06-01 04:40:02 PW 100 1501 2000 /// // // // ////// ////// ////
2019-06-01 04:40:12 PW 100 1187 2000 /// // // // ////// ////// ////
2019-06-01 04:40:22 PW 100 1187 2000 /// // // // ////// ////// ////
2019-06-01 04:40:32 PW 100 991 2000 /// // // // ////// ////// ////
How can I get rid of this error?
error_bad_lines=False is not an option, because I need exactly these values
You should use sep=" +"
or sep="\s+"
instead of sep=" "
. With the latter, multiple blanks are separated into multiple empty columns which causes the error when the number of blanks changes.
As an alternative, you could specifiy delim_whitespace=True
instead of sep
or delimiter
.