Search code examples
pythoncsvutf-8spss

Write bytes from savReader using csv.writer without date to float conversion


I want to read data from a .sav (SPSS) file and rewrite it to .csv for further use. For reading I use savReaderWriter.SavReader and it returns all strings in byte notation: b'string' instead of 'string'.

The following is my code in python 3.6:

import savReaderWriter
import csv

with savReaderWriter.SavReader('input_filename.sav') as reader:
    header = reader.header
    with open('output_filename.csv','w',newline='') as output:
        w = csv.writer(output,delimiter=',')
        w.writerow(header)
        for line in reader:
            w.writerow(line)

One solution I've found is to specify ioUtf8=True in SavReader but then all date variables are converted to float: b'2017-09-02' becomes 13723689600.0 which is then read by datetime.fromtimestamp as year 2404.

Another thing that works is

w.writerow([h.decode('utf-8') for h in header])

but only for header, as other rows contain floats and nan-s and hence produce errors.

Specifying 'wb' instead of 'w' in open also returns an error:

TypeError: a bytes-like object is required, not 'str'

Any ideas of how to read and write this kind of data properly?


Solution

  • I found a temporary solution, although I'm not proud of it. Maybe somebody else can improve it.

    import savReaderWriter
    import csv
    
    utf_errors = 0
    with savReaderWriter.SavReader('input_filename.sav') as reader:
        header = reader.header
        header = [h.decode('utf-8') for h in header]
        with open('output_filename.csv','w',newline='') as output:
            w = csv.writer(output,delimiter=',')
            w.writerow(header)
            for line in reader:
                newline = []
                for l in line:
                    try:
                        newline += [l.decode('utf-8')]
                    except AttributeError:
                        # for non-string (floats and nan-s)
                        newline += [l]
                try:        
                    w.writerow(newline)
                except UnicodeEncodeError:
                        # omit row when an unknown character is found
                        utf_errors += 1
                        pass
    
    read_output = pd.read_csv(path+'output_filename.csv', encoding='latin1') 
    

    Strange thing about the data is that however I decode it, there are always symbols that can't be read. I found it the most efficient with .decode('utf-8') (omits 4 lines) compared to .decode('latin1') (omits 29 lines) but then I have to read it with encoding='latin1', otherwise I get this error:

    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 9: invalid continuation byte