Search code examples
pythonpython-3.xutf-8posixpathlib

Processing non-UTF-8 Posix filenames using Python pathlib?


I'm trying to use the pathlib module that became part of the standard library in Python 3.4+ to find and manipulate file paths. Although it's an improvement over the os.path style functions to be able to treat paths in an object-oriented way, I'm having trouble dealing with some more exotic filenames on Posix filesystems; specifically files whose names contain bytes that cannot be decoded as UTF-8:

>>> pathlib.PosixPath(b'\xe9')

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.5/pathlib.py", line 969, in __new__
    self = cls._from_parts(args, init=False)
  File "/usr/lib/python3.5/pathlib.py", line 651, in _from_parts
    drv, root, parts = self._parse_args(args)
  File "/usr/lib/python3.5/pathlib.py", line 643, in _parse_args
    % type(a))
TypeError: argument should be a path or str object, not <class 'bytes'>

>>> b'\xe9'.decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 0: unexpected end of data

The problem with this is that on a Posix filesystem, such files can exist, and I'd like to be able to process any filesystem-valid filenames in my application rather than cause errors and/or upredictable behaviour.

I can get a PosixPath object for such files inside a directory by using the .iterdir() method of the parent directory. But I have yet to find a way to get it from a full path that was provided as a variable of type 'bytes', which is rather hard to avoid when loading paths from another source which fully supports all filesystem-valid raw byte values (such as a database or a file containing nul-separated paths).

Is there a way to do this that I'm not aware of? Or, if it's really not possible: is this by design, or could it be considered a deficiency in the standard library that might warrant a bug report?

I did find a related bug report, but that issue concerned documentation incorrectly mentioning that arguments of class 'bytes' were allowed.


Solution

  • I think you can get what you want like this:

    import os
    PosixPath(os.fsdecode(b'\xe9'))
    

    Demo:

    >>> import os, pathlib
    >>> b = b'\xe9'
    >>> p = pathlib.Path(os.fsdecode(b))
    >>> p.exists()
    False
    >>> with open(b, mode='w') as f:
    ...     f.write('wacky filename')
    ...     
    >>> p.exists()
    True
    >>> p.read_bytes()
    b'wacky filename'
    >>> os.listdir(b'.')
    [b'\xe9']