I'm trying to parse cities5000.txt
from geonames.org (http://download.geonames.org/export/dump/cities5000.zip) with python's csv module and getting very strange behavior: cvs
don't split all the lines in file.
for example:
>>> len(open('cities5000.txt').read().splitlines())
46955
>>> len(list(csv.reader(open('cities5000.txt'))))
46955
# but here comes some fun
>>>len(list(csv.reader(open('cities5000.txt'), delimiter='\t')))
46048
and the '\t'
- is the actual delimiter used in this file. So there are about 900 records that just recognized as a part of some other records' fields. But everything else is fine in parsed data.
The question is: what is the reason of this and how could I escape it without splitting all these records manually?
The default dialect also specifies a quote char, which can be used to escape newlines. You can override it with quotechar=None
.
>>> len(open('cities5000.txt').read().splitlines())
46957
>>> len(list(csv.reader(open('cities5000.txt'), delimiter='\t')))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
_csv.Error: field larger than field limit (131072)
>>> len(list(csv.reader(open('cities5000.txt'), delimiter='\t', quotechar=None)))
46957