Search code examples
pythoncsvgeonames

Strange python csv module behavior - don't splitting records


I'm trying to parse cities5000.txt from geonames.org (http://download.geonames.org/export/dump/cities5000.zip) with python's csv module and getting very strange behavior: cvs don't split all the lines in file.

for example:

>>> len(open('cities5000.txt').read().splitlines())
46955
>>> len(list(csv.reader(open('cities5000.txt'))))
46955
# but here comes some fun
>>>len(list(csv.reader(open('cities5000.txt'), delimiter='\t')))
46048

and the '\t' - is the actual delimiter used in this file. So there are about 900 records that just recognized as a part of some other records' fields. But everything else is fine in parsed data.

The question is: what is the reason of this and how could I escape it without splitting all these records manually?


Solution

  • The default dialect also specifies a quote char, which can be used to escape newlines. You can override it with quotechar=None.

    >>> len(open('cities5000.txt').read().splitlines())
    46957
    >>> len(list(csv.reader(open('cities5000.txt'), delimiter='\t')))
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    _csv.Error: field larger than field limit (131072)
    >>> len(list(csv.reader(open('cities5000.txt'), delimiter='\t', quotechar=None)))
    46957