Search code examples
pythonsimplexmlrpcserver

ExpatError: not well-formed (invalid token) when using SimpleXMLRPCServer caused by diacritic characters


It took me a long while to pinpoint some specific cause of the bug. I am writing a simple XML RPC server that allows you for directory listing and possibly other read-only operations. I already made a simple method to list all folders and files and represent them as dictionary:

def list_dir(self, dirname):
    """Returns list of files and directories as a dictionary, where key is name and values is either 'file' or 'dir'"""
    dirname = os.path.abspath(os.path.join(self.server.cwd,dirname))
    #Check that the path doesn't lead above 
    if dirname.find(self.server.cwd)==-1:
        raise SecurityError("There was an attempt to browse in %s wthich is above the root working directory %s."%(dirname, self.server.cwd))
    check_for_valid_directory(dirname)
    #Looping through directory
    files = [i for i in os.listdir(dirname)]
    #Create associative array
    result = {}
    #Iterate through files
    for f in files:
        fullpath = os.path.join(dirname, f)
        #Appending directories
        if os.path.isdir(fullpath):
            result[f] = "dir"
        else:
            result[f] = "file" 

    print "Sending data", result   
    return result

Now when directory contains file (or rather folder) named Nová složka the client receives error instead of desired list. When I removed the problematic filename I received data with no errors. I don't think Python library has this right - either the argument conversion should be complete, including any unicode stuff, or not present at all.

But anyway, how should I encode the data Python library can't handle?


Solution

  • You have to make sure the filenames and paths are unicode objects and that all filenames use the correct encoding. The last part may be a bit tricky as POSIX filenames are byte strings and there is no requirement that all filenames on a partition have to be encoded with the same encoding. In that case there is not much you can do other than decoding the names yourself and handle errors somehow or returning the filenames as binary data instead of (unicode) strings.

    The filename related functions in os and os.path return unicode strings if they get unicode strings as arguments. So if you make sure that dirname is of type unicode instead of str then os.listdir() will return unicode strings which should be able to be transitted via XML-RPC.