Search code examples
pythoncsvunicodelatin1

Converting ISO-8859-1 to utf-8 (øæå)


I have a txt document containing letters ('øæå') and i want this script to recognize this letters and properly write them to the csv-file.

with codecs.open('transaksjonliste.txt', 'r', 'ISO-8859-1') as file:
    for line in file:

        line = file.readline() 
        lineS = line.encode('ISO-8859-1', 'ignore').decode('utf-8')
        splitTab = lineS.split(';')

        for s in splitTab:
            newS = s[1:-1]

        date = splitTab[0].replace('.', '/')
        insertList = [date,]
        out.writerow(date)

Gives:

  File "Q:\DropBox\Development\Scripts\tes2.py", line 17, in <module>
    lineS = line.encode('ISO-8859-1', 'ignore').decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 14: invalid start byte

Solution

  • with codecs.open('transaksjonliste.txt', 'r', 'ISO-8859-1') as file:
        for line in file:
    
            line = file.readline() 
            lineS = line.encode('ISO-8859-1', 'ignore').decode('utf-8')
            splitTab = lineS.split(';')
    

    Remove line = file.readline() , you are already iterating(reading) through the lines with the for line in file construct.

    lineS = line.encode('ISO-8859-1', 'ignore').decode('utf-8')
    

    wouldn't be what you want, as this encodes to ISO-8859-1 and then tries to decode the ISO-8859-1 as if it was UTF-8. If you want to convert 'ISO-8859-1' to UTF-8, you'd normally want to do

     lineS = line.decode('ISO-8859-1', 'ignore').encode('utf-8')
    

    However you've already converted the data from 'ISO-8859-1' (to unicode) in the codecs.open() expression. So you just need to do

      lineS = = line.encode('utf-8')