I want to read data from a .sav (SPSS) file and rewrite it to .csv for further use. For reading I use savReaderWriter.SavReader
and it returns all strings in byte notation: b'string'
instead of 'string'
.
The following is my code in python 3.6:
import savReaderWriter
import csv
with savReaderWriter.SavReader('input_filename.sav') as reader:
header = reader.header
with open('output_filename.csv','w',newline='') as output:
w = csv.writer(output,delimiter=',')
w.writerow(header)
for line in reader:
w.writerow(line)
One solution I've found is to specify ioUtf8=True
in SavReader
but then all date variables are converted to float: b'2017-09-02'
becomes 13723689600.0
which is then read by datetime.fromtimestamp
as year 2404.
Another thing that works is
w.writerow([h.decode('utf-8') for h in header])
but only for header, as other rows contain floats and nan-s and hence produce errors.
Specifying 'wb'
instead of 'w'
in open
also returns an error:
TypeError: a bytes-like object is required, not 'str'
Any ideas of how to read and write this kind of data properly?
I found a temporary solution, although I'm not proud of it. Maybe somebody else can improve it.
import savReaderWriter
import csv
utf_errors = 0
with savReaderWriter.SavReader('input_filename.sav') as reader:
header = reader.header
header = [h.decode('utf-8') for h in header]
with open('output_filename.csv','w',newline='') as output:
w = csv.writer(output,delimiter=',')
w.writerow(header)
for line in reader:
newline = []
for l in line:
try:
newline += [l.decode('utf-8')]
except AttributeError:
# for non-string (floats and nan-s)
newline += [l]
try:
w.writerow(newline)
except UnicodeEncodeError:
# omit row when an unknown character is found
utf_errors += 1
pass
read_output = pd.read_csv(path+'output_filename.csv', encoding='latin1')
Strange thing about the data is that however I decode it, there are always symbols that can't be read. I found it the most efficient with .decode('utf-8')
(omits 4 lines) compared to .decode('latin1')
(omits 29 lines) but then I have to read it with encoding='latin1'
, otherwise I get this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 9: invalid continuation byte