Search code examples
pythonencodingpython-2.7base64osx-mountain-lion

Encoding file names to base64 on OS X not correct when using Japanese characters


I have a bunch of files named after people's names (e.g. "john.txt", "mary.txt") but among them are also japanese names (e.g. "fūka.txt", "tetsurō.txt").

What I'm trying to do is to convert names before ".txt" to Base64.

Only problem is that when I take a file name (without extension) and use a web based converter I get a different result than encoding with a help of my Python script.

So... For example when I copy file name part without extension and encode "fūka" in http://www.base64encode.org I get "ZsWra2E=". Same result I get when I take person's name from UTF-8 encoded PostgreSQL database, make it lower case and base64 encode it.

But when I use Python script below I get "ZnXMhGth"

import glob, os
import base64

def rename(dir, pattern):
    for pathAndFilename in glob.iglob(os.path.join(dir, pattern)):

        title, ext = os.path.splitext(os.path.basename(pathAndFilename))

        t = title.lower().encode("utf-8")

        encoded_string = base64.b64encode(t) + ext

        p = os.path.join(dir, encoded_string)

        os.rename(pathAndFilename, p)

rename(u'./test', u'*.txt')

I get the same results in OS X 10.8 and Linux (files uploaded from Mac to Linux server). Python is 2.7. And I tried also PHP script (the result was same as for Python script).

And similar difference happens when I use names with other characters (e.g. "tetsurō").

One more strange thing ... when I output filename part with a Python script in OS X's Terminal application and then copy this text as a filename ... and THEN encode file name to base64, I get the same result as on a webpage I mentioned above. Terminal has UTF-8 encoding.

Could somebody please explain me what am I doing (or thinking) wrong? Is there somewhere inbetween some little character substitution going on? How can I make Python script get the same result as above mentioned web page Any hints will be greatly appreciated.

SOLUTION:

With a help of Marks answer I modified a script and it worked like a charm! Thanks Mark!

import glob, os
import base64
from unicodedata import normalize

def rename(dir, pattern):
    for pathAndFilename in glob.iglob(os.path.join(dir, pattern)):

        title, ext = os.path.splitext(os.path.basename(pathAndFilename))

        t = normalize('NFC', title.lower()).encode("utf-8") # <-- NORMALIZE !!!

        encoded_string = base64.b64encode(t) + ext

        p = os.path.join(dir, encoded_string)

        os.rename(pathAndFilename, p)

rename(u'./test', u'*.txt')

Solution

  • It appears that the Python script is using a normalized form of Unicode, where the ū has been split into two characters, u and a combining macron. The other form uses a single character latin small letter u with macron. As far as Unicode is concerned, they're the same string even though they don't have the same binary representation.

    You might get some more information from this Unicode FAQ: http://www.unicode.org/faq/normalization.html