Search code examples
pythonpython-re

re.findall to get a list directories from a /-separated pathname, but allowing // as a literal, single /


The title pretty much says it.

I tried a variety of things, including but not limited to:

>>> re.findall(r'(/+)([^/]*)', '///a//b/c///d')
[('///', 'a'), ('//', 'b'), ('/', 'c'), ('///', 'd')]

And:

>>> re.findall('(/+[^/]*)', '///a//b/c///d')
['///a', '//b', '/c', '///d']

What I want is something like:

>>> re.findall(something, '///a//b/c///d')
['/', 'a/b', 'c/', 'd']

...or close to that. Note that this example is of a relative path, because the // at the beginning is a single slash comprising the entire first folder name.

We have something working using string.split('/') and list operations, but we want to explore regex-based solutions.

Thanks!


Solution

  • Assuming that escaping has precedence over splitting (i.e. '///' = '/' + separator), you could do it like this :

    p = '///a//b/c///d'
    
    import re # this is not the ideal tool for this kind of thing
    
    # pattern splits '/' when it is preceded by '//' (escaped '/')
    # or when it is not preceded by another '/'
    # in both cases the '/' must not be followed by another '/'
    
    pattern = r"((?<=\/\/)|(?<!\/))(?!.\/)\/"
    
    # replace the separators by an end of line then split on it
    # after unescaping the '//'
    
    path = re.sub(pattern,"\n",p).replace("//","/").split("\n")
    
    # or split and unescape (exclude empty parts generated by re.split)
    
    path = [s.replace("//","/") for s in re.split(pattern,p) if s] 
    
    print(path) # ['/', 'a/b', 'c/', 'd']
    

    However a non-re solution will probably be more manageable:

    path = [s.replace("\0","/") for s in p.replace("//","\0").split("/")]
    
    # or
    
    path = p.replace("//","\0").replace("/","\n").replace("\0","/").split("\n")
    
    print(path) # ['/', 'a/b', 'c/', 'd']
    

    Note: to obtain ["c//","d"] you would need the source to be encoded as "c/////d"