Search code examples
pythonpandasfile-processing

Python pandas misses some end of line characters while loading csv file


I have a csv file (tab seperated) written in German. I did not create the file. I tried to read that file by using Python's pandas package. I do the following:

import pandas as pd
trn_file ="data/train.csv"
pd_train = pd.read_csv(trn_file,delimiter='\t',encoding='utf-8',header=None)
# pd_train is [1153 rows x 12 columns]
# the first  couple of rows of pd_train can be seen below:
>>> pd_train
        0                                                  1                                     2    3           4   5   6                                                7                                                8                      9     10    11
0       35  Auch in Großbritannien, wo 19 Atomreaktoren in...                              Ausstieg -1.0  2011-03-13  10  10                                     Sunday Times                                     Sunday Times           Sunday Times   NaN     1
1      117  Deswegen sollte Deutschland nicht für [...] we...                              Ausstieg  1.0  2011-04-11  60  62                                 Dietram Hoffmann                                 Dietram Hoffmann                    NaN   NaN   121

When I investigate the dataframe, I realized that the file does not properly parsed. I mean, I see lines that seems merged even though there is a newline character between them. For example the example below shows a sentence but actually it contains 4 sentences. (They should have been in seperate rows in the dataframe):

>>> pd_train[1][483]
'Wer keine Brücke will, kann auch keine Brückenmaut verlangen. Eine Klage gegen die Kernbrennstoffsteuer schließe ich nicht aus.\tKonsens/Einigkeit\t-1.0\t2011-05-03\t90\t91\tEon\tJohannes Teyssen\tEon\t\t558\n3\tEin solches schicksalhaftes Langzeitprojekt ist für einen kurzsichtigen Profilierungswettstreit der Parteien ungeeignet. Deshalb müssen wir einen Konsens finden, der von einer breiten Mehrheit auf Dauer getragen wird.\tKonsens/Einigkeit\t1.0\t2011-05-10\t50\t55\tAlois Glück\tAlois Glück\tZentralkomitee der Katholiken\t31.0\t576\n1459\tWir brauchen jetzt keine Kommissionen, sondern einen neuen, breiten Konsens, der dann wirklich hält.\tKonsens/Einigkeit\t1.0\t2011-04-12\t30\t30\tClaudia Roth\tClaudia Roth\tGrüne\t34.0\t671\n1745\tDie Parteispitze zeigt sich offen für einen Konsens. Das würde die Richtigkeit des Atomausstiegs und des grünen Kurses besiegeln", sagt Steffi Lemke, politische Geschäftsführerin der Grünen.'

How can I fix this problem?

Please let me know If I need to provide further information.

EDIT I tried @abby's suggestions. When I gave the full path, nothing changed, when I remove the delimeter and encoding parameters, I got the following erros:

pd.read_csv(trn_file,header=None)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 678, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 446, in _read
    data = parser.read(nrows)
  File "anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1036, in read
    ret = self._engine.read(nrows)
  File "anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1848, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 891, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 945, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2112, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 11 fields in line 14, saw 12

Solution

  • The problem is that some text entries contain quoting characters. They mask the delimiters and line feeds. By specifiying quoting = csv.QUOTE_NONE you can switch off this special treatment of quoting chars. So use

    pd_train = pd.read_csv(trn_file,delimiter='\t',encoding='utf-8',header=None,quoting = csv.QUOTE_NONE)
    

    to read files with occasional quoting characters. See https://docs.python.org/3/library/csv.html:

    csv.QUOTE_NONE

    Instructs writer objects to never quote fields. When the current delimiter occurs in output data it is preceded by the current escapechar character. If escapechar is not set, the writer will raise Error if any characters that require escaping are encountered.

    Instructs reader to perform no special processing of quote characters.