Search code examples
pythonhtmlencodingutf-8cherrypy

Python ─ UTF-8 filename from HTML form via CherryPy


Python Header:      # ! /usr/bin/env python
                    # -*- coding: utf-8 -*-
                    # image_upload.py

Cherrypy Config:    cherrypy.config.update(
                        {'tools.encode.on': True,
                         'tools.encode.encoding': 'utf-8',
                         'tools.decode.on': True,
                        },)

HTML Header:        <head><meta http-equiv="Content-Type"
                    content="text/html;charset=ISO-8859-1"></head>

""" Python        2.7.3
    Cherrypy      3.2.2
    Ubuntu        12.04
"""

With an HTML form, I'm uploading an image file to a database. That works so far without problems. However, if the filename ist not 100% in ASCII, there seems to be no way to retrieve it in UTF-8. This is weird, because with the HTML text input fields it works without problems, from saving until showing. Therefore I assume that it's an encoding or decoding problem with the web application framework CherryPy, because the upload is handeld by it, like here.

How it works:
The HTML form POSTs the uploaded file to another Python function, which receives the file in the standard dictionary **kwargs. From here you get the filename with extention, like this: filename = kwargs['file'].filename. But that's already with the wrong encoding. Until now the image hasn't been processed, stored or used in any way.

I'm asking for a solution, which would prevent it, to just parse the filename and change it back "manually". I guess the result already is in UTF-8, which makes it cumbersome to get it right. That's why getting CherryPy to do it, might be the best way. But maybe it's even an HTML issue, because the file comes from a form.

Here are the wrong decoded umlauts.
What I need is the input as result.

input → result        input → result  
  ä   →   ä            Ä   →   Ä  
  ö   →   ö            Ö   →   Ö 
  ü   →   ü            Ü   →   Ãœ  

Following are the failed attempts to get the right result, which would be: "Würfel"
NOTE: img_file = kwargs['file']


  • original attempt:

    result = img_file.filename.rsplit('.',1)[0]
    

    result: "Würfel"


  • change system encoding:

    reload(sys)
    sys.setdefaultencoding('utf-8')
    

    result: "Würfel"


  • encoding attempt 1:

    result = img_file.filename.rsplit('.',1)[0].encode('utf-8')
    

    result: "Würfel"


  • encoding attempt 2:

    result = unicode(img_file.filename.rsplit('.',1)[0], 'urf-8')
    

    Error Message:

    TypeError: decoding Unicode is not supported
    

  • decoding attempt:

    result = img_file.filename.rsplit('.',1)[0].decode('utf-8')
    

    Error Message:

    UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-2: ordinal not in range(128)
    

  • cast attempt:

    result = str(img_file.filename.rsplit('.',1)[0])
    

    Error Message:

    UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-2: ordinal not in range(128)
    


Solution

  • Trying with your string it seems I can get the filename using latin1 encoding.

    >>> s = u'W\xc3\xbcrfel.jpg'
    >>> print s.encode('latin1')
    Würfel.jpg
    >>> 
    

    You simply need to use that .encode('latin1') before splitting. But the problem here is broader. You really need to figure out why your web encoding is latin1 instead of utf8. I don't know cherrypy but try to ensure to use utf8 or you could get in other glitches when serving your application through a webserver like apache or nginx.