subprocess call failing with unicode data from web form input - works with same input from elsewhere

I have a library that contains a function that checks various input data against a number of regexps to ensure that the data is valid. The function is called both for input received by a CGI script from a web form (via lighttpd) and for input read from an sqlite database (which input is put there, in turn, based on SMS's received by gammu-smsd).

The input is at times in English and at times in Hindi, i.e. in Devnagari script. It should always be encoded with UTF-8. I have struggled with python's re and regex modules, which seem to be buggy when it comes to correctly matching character classes to Devnagari characters (you can see one example here - in that case using regex instead of re fixed the problem, but I've since had trouble with regex too). Command line 'grep' appears far more reliable and accurate. Hence, I've resorted to using a subprocess call to pipe the requisite strings to grep, as follows:

def invalidfield(datarecord,msgtype):
  for fieldname in datarecord:
    if (msgtype,fieldname) in mainconf["MSG_FORMAT"]:
        try:
            check = subprocess.check_output("echo '" + datarecord[fieldname] + "' | grep -E '" + mainconf["MSG_FORMAT"][msgtype,fieldname] + "'",shell=True)
        except subprocess.CalledProcessError:
            return fieldname
return None

Now, let's try this out with the following string as input: न्याज अहमद् and the following regex to check it : ^[[:alnum:] .]*[[:alnum:]][[:alnum:] .]*$

Oddly enough, exactly the same input, when read from the database, clears this regexp (as it should) and the function returns correctly. But when the same string is entered via the webform, subprocess.check_out fails with this error:

File "/usr/lib/python2.7/subprocess.py", line 537, in check_output
  process = Popen(stdout=PIPE, *popenargs, **kwargs)
File "/usr/lib/python2.7/subprocess.py", line 679, in __init__
  errread, errwrite)
File "/usr/lib/python2.7/subprocess.py", line 1259, in _execute_child
  raise child_exception
TypeError: execv() arg 2 must contain only strings

I cannot figure out what is going on. I've modified my lighttpd.conf using this script which ought to, at least, ensure that lighttpd.conf is using the utf-8 charset. I've also used the chardet module and run chardet.detect on the input from the webform. I get this: {'confidence': 1.0, 'encoding': 'ascii'}{'confidence': 0.99, 'encoding': 'utf-8'}

In accordance with this answer I tried replacing datarecord[fieldname] in the above with unicode(datarecord[fieldname]).encode('utf8') and also with first trying to decode datarecord[fieldname] with the 'ascii' codec. The latter fails with the usual 'ordinal not in range' error.

What is going wrong here? I just can't figure it out!

Solution

You want to avoid using echo in this case; write your input directly to the stdin pipe of the Popen() object instead.

Do make sure your environment is set to the correct locale so that grep knows to parse the input as UTF-8:

env = dict(os.environ)
env['LC_ALL'] = 'en_US.UTF-8'
p = subprocess.Popen(['grep', '-E', mainconf["MSG_FORMAT"][msgtype,fieldname]], stdin=subprocess.PIPE, env=env)
p.communicate(datarecord[fieldname])
if p.returncode:
     return fieldname