Search code examples
pythonpython-2.7unicodepopenpython-unicode

Popen subprocess does not exit when stdin includes unicode


I am executing a subprocess using Popen and feeding it input as follows (using Python 2.7.4):

env = dict(os.environ)
env['LC_ALL'] = 'en_US.UTF-8'
args = ['chasen', '-i u', '-F"%m "']
process = Popen(args, stdout=PIPE, stderr=PIPE, stdin=PIPE, env=env)
out, err = process.communicate(input=string)

Adding the entry to the environment it is executed with is necessary because the input string includes Japanese characters, and when the script is not executed from the command line (in my case being called by Apache), Python cannot guess the encoding.

This setup has worked fine for me with other commands, however now I'm using chasen (a Japanese tokenizer), whenever I send it unicode characters the subprocess does not return, and it just sits there with the Python script chewing up memory. This seems like an encoding problem, but I thought I had would have sorted this out by specifying the encoding with the LC_ALL environment variable.

Edit: Extra weirdness as follows... I don't get this problem when executing the Python script from the command line with the notable exception of the '。' character. For some reason this causes the strangeness from chasen also.


Solution

  • This is a bug in chasen. When run through Python, you can see the following syscalls it issues:

    write(1, "\n", 1)                       = 1
    read(0, "", 4096)                       = 0
    write(1, "\n", 1)                       = 1
    read(0, "", 4096)                       = 0
    

    i.e. it does not correctly handle EOF. To fix this, simply affix a newline ('\n') to your Python string, like this:

    # coding: utf-8
    import os
    from subprocess import Popen, PIPE
    
    string = u"悪妻は百年の不作。"
    
    env = dict(os.environ)
    env['LC_ALL'] = 'en_US.UTF-8'
    args = ['chasen', '-i u', '-F"%m "']
    process = Popen(args, stdout=PIPE, stderr=PIPE, stdin=PIPE, env=env)
    out, err = process.communicate(input=(string + u'\n').encode('utf-8'))
    
    print(out)