I'm trying to convert SPSS syntax files to readable HTML. It's working almost perfectly except that a (single) non printable character is inserted into the HTML file. It doesn't seem to have an ASCII code and looks like a tiny dot. And it's causing trouble.
It occurs (only) in the second line of the HTML file, always corresponding to the first line of the original file. Which probably hints at which line(s) of Python cause the problem (please see comments)
The code which seems to cause this is
rfil = open(fil,"r") #rfil = Read File, original syntax
wfil = open(txtFil,"w") #wfil = Write File, HTML output
#Line below causes problem??
wfil.write("<ol class='code'>\n<li>")
cnt = 0
for line in rfil:
if cnt == 0:
#Line below causes problem??
wfil.write(line.rstrip("\n").replace("'",''').replace('"','"'))
elif len(line) > 1:
wfil.write("</li>\n<li>" + line.strip("\n").replace("'",''').replace('"','"'))
else:
wfil.write("<br /><br />")
cnt += 1
wfil.write("</li>\n</ol>")
wfil.close()
rfil.close()
Screen shot of the result
The input file seems to begin with a byte order mark (BOM), to indicate UTF-8 encoding. You can decode the file to Unicode strings by opening it with
import codecs
rfil = codecs.open(fil, "r", "utf_8_sig")
The utf_8_sig encoding skips the BOM in the beginning.
Some programs recognize the BOM, some don't. To write the file out without BOM, use
wfil = codecs.open(txtFil, "w", "utf_8")