Search code examples
pythonxmlunicodettyvt100

Storing VT100 escape codes in an XML file


I'm writing a Python program that logs terminal interaction (similar to the script program), and I'd like to store the log in XML format.

The problem is that the terminal interaction includes VT100 escape codes. Python doesn't complain if I write the data to a file as UTF-8 encoded, e.g.:

...
pid, fd = pty.fork()
if pid==0:
    os.execvp("bash",("bash","-l"))
else:
    # Lots of TTY-related stuff here
    # see http://groups.google.com/group/comp.lang.python/msg/de40b36c6f0c53cc
    fout = codecs.open("session.xml", encoding="utf-8", mode="w")
    fout.write('<?xml version="1.0" encoding="UTF-8"?>\n')
    fout.write("<session>\n")
    ...
    r, w, e = select.select([0, fd], [], [], 1)
    for f in r:
        if f==fd:
            fout.write("<entry><![CDATA[")
            buf = os.read(fd, 1024)
            fout.write(buf)
            fout.write("]]></entry>\n")
        else:
            ....
    fout.write("</session>")
    fout.close()

This script "works" in the sense that it writes a file to disk, but the resulting file is not proper utf-8, which causes XML parsers like etree to barf on the escape codes.

One way to deal with this is to filter out the escape codes first. But if is it possible to do something like this where the escape codes are maintained and the resulting file can be parsed by XML tools like etree?


Solution

  • Your problem is not that the control codes aren't proper UTF-8, they are, it's just ASCII ESC and friends are not proper XML characters, even inside a CDATA section.

    The only valid XML characters in XML 1.0 which have values less than U+0020 are U+0009 (tab), U+000A (newline) amd U+000D (carriage return). If you want to record things involving other codes such as escape (U+001B) then you will have to escape them in some way. There is no other option.