Search code examples
pythonglobfnmatch

Separate plain filenames from fnmatch patterns in python


My python function is given a (long) list of path arguments, each of which can possibly be a glob. I make a pass over this list using glob.glob to extract all the matching filenames, like this:

files  = [filename for pattern in patterns for filename in glob.glob(pattern)]

That works, but the filesystem I'm on has very poor performance for directory listing operations, and currently this operation adds about a minute(!) to the start-up time of my program. So I would like to only perform glob expansion for non-trivial glob patterns (i.e. those that aren't just normal pathnames) to speed this up. I.e.

def cheapglob(pattern):
    return [pattern] if istrivial(pattern) else glob.glob(pattern)
files  = [filename for pattern in patterns for filename in cheapglob(pattern)]

Since glob.glob basically does a set of directory listings coupled with fnmatch.fnmatch, I thought it should be possible to somehow ask fnmatch whether a given string is a non-trivial pattern or not, but I can't see how to do that.

As a fallback, I guess I could attempt to identify these patterns in the string myself, though that feels a lot like reinventing the wheel, and would be error prone. But this feels like the sort of thing there should be an elegant solution for.


Solution

  • According to the fnmatch source code, the only special characters it recognizes are *, ?, [ and ]. Hence any pattern that does not contain any of these will only match itself. We can therefore implement the cheapglob mentioned in the question as

    def cheapglob(s): return glob.glob(s) if re.search("[][*?]", s) else [s]
    

    This will only hit the file system for patterns which include special characters. This differs subtly from a plain glob.glob: For a pattern with no special characters like "foo.txt", this function will return ["foo.txt"] regardless of whether that file exists, while glob.glob will return [] if the file isn't there. So the calling function will need to handle the possibility that some of the returned files might not exist.