Search code examples
pythonunicodeportabilitydoctest

Doctest fails due to unicode leading u


I am writing a doctest for a function that outputs a list of tokenized words.

r'''

>>> s = "This is a tokenized sentence s\u00f3"
>>> tokenizer.tokenize(s0)
['This', 'is', 'a', 'tokenized', 'sentence', 'só']

'''

Using Python3.4 my test passes with no problems.

Using Python2.7 I get:

Expected:
  ['This', 'is', 'a', 'tokenized', 'sentence', 'só']
Got:
  [u'This', u'is', u'a', u'tokenized', u'sentence', u's\xf3']

My code has to work on both Python3.4 and Python2.7. How can I solve this problem?


Solution

  • Python 3 uses different string literals for Unicode objects. There is no u prefix (in the canonical representation) and some non-ascii characters are shown literally e.g., 'só' is a Unicode string in Python 3 (it is a bytestring on Python 2 if you see it in the output).

    If all you interested is how the function splits an input text into tokens; you could print each token on a separate line, to make the result Python 2/3 compatible:

    print("\n".join(tokenizer.tokenize(s0)))
    This
    is
    a
    tokenized
    sentence
    só
    

    As an alternative, you could customize doctest.OutputChecker, example:

    #!/usr/bin/env python
    r"""
    >>> u"This is a tokenized sentence s\u00f3".split()
    [u'This', u'is', u'a', u'tokenized', u'sentence', u's\xf3']
    """
    import doctest
    import re
    import sys
    
    class Py23DocChecker(doctest.OutputChecker):
        def check_output(self, want, got, optionflags):
            if sys.version_info[0] > 2:
                want = re.sub("u'(.*?)'", "'\\1'", want)
                want = re.sub('u"(.*?)"', '"\\1"', want)
            return doctest.OutputChecker.check_output(self, want, got, optionflags)
    
    if __name__ == "__main__":
        import unittest
    
        suite = doctest.DocTestSuite(sys.modules['__main__'], checker=Py23DocChecker())
        sys.exit(len(unittest.TextTestRunner().run(suite).failures))