IndexError: list index out of range is thrown now that i've changed the way the file is read

I am trying to read and reformat a very large (2GB+) .out file that is structured like a csv. I had previously used the standard open(), with no such issue, but changed it to codecs.open() as it was having trouble with some characters.

It is now throwing

Traceback (most recent call last): line 21, in <module> if(r[5]==""): IndexError: list index out of range on the first row, although there is definitely an element at r[5]. (runtime is 0.301s)

import sys
import csv
import datetime
import codecs
maxInt=sys.maxsize
decrement=True

while decrement:
    decrement=False
    try:
        csv.field_size_limit(maxInt)
    except OverflowError:
        maxInt = int(maxInt/10)
        decrement = True

with codecs.open("file.out", 'rU', 'utf-16-be') as source:
    rdr = csv.reader(source)
    with open("out.csv","w", newline='') as result:
        wtr = csv.writer(result)
        wtr.writerow(("Column1", "column2", "column3", "etc..."))
        for r in rdr:
            if(r[5]==""):
                continue
            wtr.writerow((datetime.datetime.strptime(r[5], '%m/%d/%Y').strftime('%Y-%m-%d'), r[3], r[7], r[9]+r[10]+" "+r[12]))

using utf-8 throws UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9 in position 12: invalid continuation byte

using latin-1 or ISO-8859-1 throws UnicodeEncodeError: 'charmap' codec can't encode characters in position 57-58: character maps to <undefined>, albeit after running much more.

input file looks like this:

"A00017","K","G","1999","4530","01/12/1999","","","","PEOPLE TO ELECT MANGINELLI","","","","258 MAGNIOLIA DRIVE","SELDEN","NY","11784","","","404.57","","","","","","","2","","NAA","07/22/1999 08:43:59"
"A00037","K","G","1999","999999","01/12/1999","","","","CITIZENS TO ELECT TEDISCO TO ASSEMBLY","","","","","","","","","","0","","","","","","","2","","",""
"A00037","K","N","1999","1693","01/15/1999","","","","OUTSTANDING LOAN","","","","2176 GUILDERLAND AVE","SCHENECTADY","NY","12306","","","10474.8","10474.8","","","OTHER","","PREVIOUS LOAN FROM JAMES TEDISCO","","P","JM","07/15/1999 15:08:17"
"A00037","J","N","2000","1694","01/13/2000","","","","OUTSTANDING LOAN","","","","2176 GUILDERLAND","SCHENECTADY","NY","12306","","","10474.8","10474.8","","","OTHER","","LOANS FROM PREVIOUS CAMPAIGNS FROM J","","P","JM","01/14/1900 16:35:09"
"A00037","K","X","2000","999999","","","","","","","","","","","","","","","","","","","","","","","","","07/20/2000 00:00:00"
"A00037","J","X","2001","999999","","","","","","","","","","","","","","","","","","","","","","","","","01/17/2001 00:00:00"
"A00037","K","X","2002","999999","","","","","","","","","","","","","","","","","","","","","","","","","07/19/2002 00:00:00"
"A00037","J","X","2003","999999","","","","","","","","","","","","","","","","","","","","","","","","","01/21/2003 00:00:00"
"A00037","K","X","2003","999999","","","","","","","","","","","","","","","","","","","","","","","","","07/16/2003 00:00:00"
"A00037","J","X","2004","999999","","","","","","","","","","","","","","","","","","","","","","","","","01/22/2004 00:00:00"

i've gotten this far thanks to:

"Line contains NULL byte" in CSV reader (Python)

_csv.Error: field larger than field limit (131072)

Solution

In the 'file.out' which you are reading from, find out the separating character between the elements of each cell of a row. Like a '\t'-tab or ','-comma and pass it to the 'delimiter' attribute.

Try printing 'r' and see the character between the column names or the values in a row

rdr = csv.reader(source,delimiter=<separator>)