I am trying to read and reformat a very large (2GB+) .out file that is structured like a csv. I had previously used the standard open(), with no such issue, but changed it to codecs.open() as it was having trouble with some characters.
It is now throwing
Traceback (most recent call last):
line 21, in <module>
if(r[5]==""):
IndexError: list index out of range
on the first row, although there is definitely an element at r[5].
(runtime is 0.301s)
import sys
import csv
import datetime
import codecs
maxInt=sys.maxsize
decrement=True
while decrement:
decrement=False
try:
csv.field_size_limit(maxInt)
except OverflowError:
maxInt = int(maxInt/10)
decrement = True
with codecs.open("file.out", 'rU', 'utf-16-be') as source:
rdr = csv.reader(source)
with open("out.csv","w", newline='') as result:
wtr = csv.writer(result)
wtr.writerow(("Column1", "column2", "column3", "etc..."))
for r in rdr:
if(r[5]==""):
continue
wtr.writerow((datetime.datetime.strptime(r[5], '%m/%d/%Y').strftime('%Y-%m-%d'), r[3], r[7], r[9]+r[10]+" "+r[12]))
using utf-8 throws UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9 in position 12: invalid continuation byte
using latin-1 or ISO-8859-1 throws UnicodeEncodeError: 'charmap' codec can't encode characters in position 57-58: character maps to <undefined>
, albeit after running much more.
input file looks like this:
"A00017","K","G","1999","4530","01/12/1999","","","","PEOPLE TO ELECT MANGINELLI","","","","258 MAGNIOLIA DRIVE","SELDEN","NY","11784","","","404.57","","","","","","","2","","NAA","07/22/1999 08:43:59"
"A00037","K","G","1999","999999","01/12/1999","","","","CITIZENS TO ELECT TEDISCO TO ASSEMBLY","","","","","","","","","","0","","","","","","","2","","",""
"A00037","K","N","1999","1693","01/15/1999","","","","OUTSTANDING LOAN","","","","2176 GUILDERLAND AVE","SCHENECTADY","NY","12306","","","10474.8","10474.8","","","OTHER","","PREVIOUS LOAN FROM JAMES TEDISCO","","P","JM","07/15/1999 15:08:17"
"A00037","J","N","2000","1694","01/13/2000","","","","OUTSTANDING LOAN","","","","2176 GUILDERLAND","SCHENECTADY","NY","12306","","","10474.8","10474.8","","","OTHER","","LOANS FROM PREVIOUS CAMPAIGNS FROM J","","P","JM","01/14/1900 16:35:09"
"A00037","K","X","2000","999999","","","","","","","","","","","","","","","","","","","","","","","","","07/20/2000 00:00:00"
"A00037","J","X","2001","999999","","","","","","","","","","","","","","","","","","","","","","","","","01/17/2001 00:00:00"
"A00037","K","X","2002","999999","","","","","","","","","","","","","","","","","","","","","","","","","07/19/2002 00:00:00"
"A00037","J","X","2003","999999","","","","","","","","","","","","","","","","","","","","","","","","","01/21/2003 00:00:00"
"A00037","K","X","2003","999999","","","","","","","","","","","","","","","","","","","","","","","","","07/16/2003 00:00:00"
"A00037","J","X","2004","999999","","","","","","","","","","","","","","","","","","","","","","","","","01/22/2004 00:00:00"
i've gotten this far thanks to:
In the 'file.out' which you are reading from, find out the separating character between the elements of each cell of a row. Like a '\t'-tab or ','-comma and pass it to the 'delimiter' attribute.
Try printing 'r' and see the character between the column names or the values in a row
rdr = csv.reader(source,delimiter=<separator>)