Search code examples
pythonpython-3.xindentationliteralspython-unicode

Using textwrap.dedent() with bytes in Python 3


When I use a triple-quoted multiline string in Python, I tend to use textwrap.dedent to keep the code readable, with good indentation:

some_string = textwrap.dedent("""
    First line
    Second line
    ...
    """).strip()

However, in Python 3.x, textwrap.dedent doesn't seem to work with byte strings. I encountered this while writing a unit test for a method that returned a long multiline byte string, for example:

# The function to be tested

def some_function():
    return b'Lorem ipsum dolor sit amet\n  consectetuer adipiscing elit'

# Unit test

import unittest
import textwrap

class SomeTest(unittest.TestCase):
    def test_some_function(self):
        self.assertEqual(some_function(), textwrap.dedent(b"""
            Lorem ipsum dolor sit amet
              consectetuer adipiscing elit
            """).strip())

if __name__ == '__main__':
    unittest.main()

In Python 2.7.10 the above code works fine, but in Python 3.4.3 it fails:

E
======================================================================
ERROR: test_some_function (__main__.SomeTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 16, in test_some_function
    """).strip())
  File "/usr/lib64/python3.4/textwrap.py", line 416, in dedent
    text = _whitespace_only_re.sub('', text)
TypeError: can't use a string pattern on a bytes-like object

----------------------------------------------------------------------
Ran 1 test in 0.001s

FAILED (errors=1)

So: Is there an alternative to textwrap.dedent that works with byte strings?

  • I could write such a function myself, but if there is an existing function, I'd prefer to use it.
  • I could convert to unicode, use textwrap.dedent, and convert back to bytes. But this is only viable if the byte string conforms to some Unicode encoding.

Solution

  • It seems like dedent does not support bytestrings, sadly. However, if you want cross-compatible code, I recommend you take advantage of the six library:

    import sys, unittest
    from textwrap import dedent
    
    import six
    
    
    def some_function():
        return b'Lorem ipsum dolor sit amet\n  consectetuer adipiscing elit'
    
    
    class SomeTest(unittest.TestCase):
        def test_some_function(self):
            actual = some_function()
    
            expected = six.b(dedent("""
                Lorem ipsum dolor sit amet
                  consectetuer adipiscing elit
                """)).strip()
    
            self.assertEqual(actual, expected)
    
    if __name__ == '__main__':
        unittest.main()
    

    This is similar to your bullet point suggestion in the question

    I could convert to unicode, use textwrap.dedent, and convert back to bytes. But this is only viable if the byte string conforms to some Unicode encoding.

    But you're misunderstanding something about encodings here - if you can write the string literal in your test like that in the first place, and have the file successfully parsed by python (i.e. the correct coding declaration is on the module), then there is no "convert to unicode" step here. The file gets parsed in the encoding specified (or sys.defaultencoding, if you didn't specify) and then when the string is a python variable it is already decoded.