Search code examples
pythonpython-2.7unicodelxmlspss

How to handle special characters when accessing labels from SPSS in Python?


I access through Python to a SPSS database. I take all information needed from there: variable names, variable labels, value labels etc.

But when I reach to the labels, I am not able to put them as UTF-8 (this is the case in the dataset, they are supported and shown correctly when I open the data with SPSS).

The main issue here for me, is that after I get all information needed, I want to write an XML file (with LXML), but this is giving an UnicodeEncodeError.

I do not know how to proceed to get the corrected labels in the XML in the end. .encode('utf-8') is not changing anything, so I am really clueless if there is a way to do it?

My code:

import spss,spssaux
from datetime import date
mysyntax=r"""GET FILE="C:\Users\file.sav"."""
spss.Submit(mysyntax)

today_date=date.today().strftime('%Y-%m-%d')
vardict = spssaux.VariableDict()
var_list = []
var_labels = []
var_values = {}

spss.StartDataStep()
datasetObj = spss.Dataset()

index = 0
for var in datasetObj.varlist:
    varObj = datasetObj.varlist[index]
    var_list.append(varObj.name)
    var_labels.append(str(varObj.label)).encode('utf-8')
    var_values[var.name] = str(var.valueLabels).encode('utf-8')
    index += 1

spss.EndDataStep()
spss.StopSPSS()

EDIT ::

Here the result of my code to read the SPSS file, so the lists I created:

My code for LXML:

import lxml.etree
import lxml.builder    

new_xml = lxml.builder.ElementMaker()
date = new_xml.date
survey = new_xml.survey
record = new_xml.record
variable = new_xml.variable
name = new_xml.name
label = new_xml.label
nvars = len(var_list)
for i in range(nvars):
    final_xml =(date(survey(record(
                    *[variable(
                            name(str(var_list[i])),
                            label(str(var_labels[i])),
                            ident = str(i+1)) for i in range(nvars)],
                    ident = 'A'))),
                today_date)

newxml = lxml.etree.tostring(final_xml, xml_declaration=True, encoding='utf-8', pretty_print=True)

It is throwing me this error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xea' in position 9: ordinal not in range(128)

I've also tried to encode it after getting the labels as follow:

for labels in var_labels:
    labels = labels.encode('utf-8')

It appears good in Python, but when I reach to the LXML part, I still have the same problem.

Here is a reproducible code:

from datetime import date

var_labels = [u'À quelle fréquence?', u'Comment ça se passe?']
var_list = ['Q1', 'Q2']
today_date=date.today().strftime('%Y-%m-%d')


import lxml.etree
import lxml.builder    

new_xml = lxml.builder.ElementMaker()
date = new_xml.date
survey = new_xml.survey
record = new_xml.record
variable = new_xml.variable
name = new_xml.name
label = new_xml.label
nvars = len(var_list)
for i in range(nvars):
    final_xml =(date(survey(record(
                    *[variable(
                            name(str(var_list[i])),
                            label(str(var_labels[i])),
                            ident = str(i+1)) for i in range(nvars)],
                    ident = 'A'))),
                today_date)

newxml = lxml.etree.tostring(final_xml, xml_declaration=True, encoding='utf-8', pretty_print=True)

Solution

  • In Python 2, str(unicode_string) is the same as unicode_string.encode("ascii"). There are non-ASCII characters in your labels.

    The error should go away if you change str(var_labels[i]) to var_labels[i].