I access through Python to a SPSS database. I take all information needed from there: variable names, variable labels, value labels etc.
But when I reach to the labels, I am not able to put them as UTF-8 (this is the case in the dataset, they are supported and shown correctly when I open the data with SPSS).
The main issue here for me, is that after I get all information needed, I want to write an XML file (with LXML), but this is giving an UnicodeEncodeError.
I do not know how to proceed to get the corrected labels in the XML in the end. .encode('utf-8') is not changing anything, so I am really clueless if there is a way to do it?
My code:
import spss,spssaux
from datetime import date
mysyntax=r"""GET FILE="C:\Users\file.sav"."""
spss.Submit(mysyntax)
today_date=date.today().strftime('%Y-%m-%d')
vardict = spssaux.VariableDict()
var_list = []
var_labels = []
var_values = {}
spss.StartDataStep()
datasetObj = spss.Dataset()
index = 0
for var in datasetObj.varlist:
varObj = datasetObj.varlist[index]
var_list.append(varObj.name)
var_labels.append(str(varObj.label)).encode('utf-8')
var_values[var.name] = str(var.valueLabels).encode('utf-8')
index += 1
spss.EndDataStep()
spss.StopSPSS()
EDIT ::
Here the result of my code to read the SPSS file, so the lists I created:
My code for LXML:
import lxml.etree
import lxml.builder
new_xml = lxml.builder.ElementMaker()
date = new_xml.date
survey = new_xml.survey
record = new_xml.record
variable = new_xml.variable
name = new_xml.name
label = new_xml.label
nvars = len(var_list)
for i in range(nvars):
final_xml =(date(survey(record(
*[variable(
name(str(var_list[i])),
label(str(var_labels[i])),
ident = str(i+1)) for i in range(nvars)],
ident = 'A'))),
today_date)
newxml = lxml.etree.tostring(final_xml, xml_declaration=True, encoding='utf-8', pretty_print=True)
It is throwing me this error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xea' in position 9: ordinal not in range(128)
I've also tried to encode it after getting the labels as follow:
for labels in var_labels:
labels = labels.encode('utf-8')
It appears good in Python, but when I reach to the LXML part, I still have the same problem.
Here is a reproducible code:
from datetime import date
var_labels = [u'À quelle fréquence?', u'Comment ça se passe?']
var_list = ['Q1', 'Q2']
today_date=date.today().strftime('%Y-%m-%d')
import lxml.etree
import lxml.builder
new_xml = lxml.builder.ElementMaker()
date = new_xml.date
survey = new_xml.survey
record = new_xml.record
variable = new_xml.variable
name = new_xml.name
label = new_xml.label
nvars = len(var_list)
for i in range(nvars):
final_xml =(date(survey(record(
*[variable(
name(str(var_list[i])),
label(str(var_labels[i])),
ident = str(i+1)) for i in range(nvars)],
ident = 'A'))),
today_date)
newxml = lxml.etree.tostring(final_xml, xml_declaration=True, encoding='utf-8', pretty_print=True)
In Python 2, str(unicode_string)
is the same as unicode_string.encode("ascii")
. There are non-ASCII characters in your labels.
The error should go away if you change str(var_labels[i])
to var_labels[i]
.