Search code examples
pythonpython-2.7unicodeioeol

Redirect stdout to a file with unicode encoding while keeping windows eol in python 2


I hit a wall here. I need to redirect all output to a file but I need this file to be encoded in utf-8. Problem is that when using codecs.open:

# errLog = io.open(os.path.join(os.getcwdu(),u'BashBugDump.log'), 'w',
#                  encoding='utf-8')
errLog = codecs.open(os.path.join(os.getcwdu(), u'BashBugDump.log'),
                     'w', encoding='utf-8')
sys.stdout = errLog
sys.stderr = errLog

codecs opens the file in binary mode resulting in \n line terminators. I tried using io.open but this does not play with the print statement used all over the codebase (see Python 2.7: print doesn't speak unicode to the io module? or python: TypeError: can't write str to text stream)

I am not the only one having this issue for instance see here but the solution they adopted is specific to the logging module we do not use.

See also this won't fix bug in python: https://bugs.python.org/issue2131

So what's the one right way for doing this in python2 ?


Solution

  • Option 1

    Redirection is a shell operation. You don't have to change the Python code at all, but you do have to tell Python what encoding to use if redirected. That is done with an environment variable. The following code redirects both stdout and stderr to a UTF-8-encoded file:

    test.bat

    set PYTHONIOENCODING=utf8
    python test.py >out.txt 2>&1
    

    test.py

    #coding:utf8
    import sys
    print u"我不喜欢你女朋友!"
    print >>sys.stderr, u"你需要一个新的。"
    

    out.txt (encoded in UTF-8)

    我不喜欢你女朋友!
    你需要一个新的。
    

    Hex dump of out.txt

    0000: E6 88 91 E4 B8 8D E5 96 9C E6 AC A2 E4 BD A0 E5
    0010: A5 B3 E6 9C 8B E5 8F 8B EF BC 81 0D 0A E4 BD A0 
    0020: E9 9C 80 E8 A6 81 E4 B8 80 E4 B8 AA E6 96 B0 E7
    0030: 9A 84 E3 80 82 0D 0A
    

    Note: You do need to print Unicode strings for this to work. Print byte strings and you get the bytes you print.

    Option 2

    codecs.open may force binary mode, but codecs.getwriter doesn't. Give it a file opened in text mode:

    #coding:utf8
    import sys
    import codecs
    sys.stdout = sys.stderr = codecs.getwriter('utf8')(open('out.txt','w'))
    print u"我不喜欢你女朋友!"
    print >>sys.stderr, u"你需要一个新的。"
    

    (same output and hexdump as above)