Search code examples
regexurlencode

How to extract only filenames from an URL with RegEx even with ANSI escaped characters?


I need to extract ONLY the file names from any URL. I looked at all previous answers on stackoverflow regarding URLs and filenames, but no one considered the case of a file name with escaped characters.

I have for example an URL like this:

https://content.com/pbpython.py/notebooks/thirsty-allies.mov?file=The%20Big%20Kahuna.webm.tar.gz&f=Crosstab%20Explained.ipynb&a=b&m=plok%202001.tar.gz

I tried many RegEx, and finally I found one that did not split the file names when it encounter the escaped character:

"(?:\w*:\/\/)?((?:[\w-_]*\.?)+:?\d*(?:\/?[\w-_.]+\/?)*)[\?]?[\&]?([\w-_]*)?=?([\/\w-_\?\)%\.\(\*\[\]\^<>\\]*)?[\&]?([\w-_]*)?=?([\/\w-_\?\)%\.\(\*\[\]\^<>\\]*)?[\&]?([\w-_]*)?=?([\/\w-_\?\)%\.\(\*\[\]\^<>\\]*)?[\&]?([\w-_]*)?=?([\/\w-_\?\)%\.\(\*\[\]\^<>\\]*)?"g

You can test it here: https://regex101.com/r/LRWlif/7

The results are a mess:

match,group,is_participating,start,end,content
1,0,yes,0,148,https://content.com/pbpython.py/notebooks/thirsty-allies.mov?file=The%20Big%20Kahuna.webm.tar.gz&f=Crosstab%20Explained.ipynb&a=b&m=plok%202001.tar.gz
1,1,yes,8,60,content.com/pbpython.py/notebooks/thirsty-allies.mov
1,2,yes,61,65,file
1,3,yes,66,94,The%20Big%20Kahuna.webm.tar.gz
1,4,yes,95,96,f
1,5,yes,97,123,Crosstab%20Explained.ipynb
1,6,yes,124,125,a
1,7,yes,126,127,b
1,8,yes,128,129,m
1,9,yes,130,148,plok%202001.tar.gz
2,0,yes,148,148,
2,1,yes,148,148,
2,2,yes,148,148,
2,3,yes,148,148,
2,4,yes,148,148,
2,5,yes,148,148,
2,6,yes,148,148,
2,7,yes,148,148,
2,8,yes,148,148,
2,9,yes,148,148,

The only good thing is that the filenames are all matched somehow, with no split parts, with the exception of "thirsty-allies.mov" that is matched along some url parts.

Also there is the issue that not all escape characters can be part of a filename. %2F for example is the "/" that separate folders in paths, and should not considered part of the match.

For example:

https://www.contoso.com/sites/marketing/documents/Shared%20Documents/Forms/AllItemA.aspx?RootFolder=%2Fsites%2Fmarketing%2Fdocuments%2FShared%20Documents%2FPFProduct%20Promotion%202001.docx&FolderCTID=0x012000F2A09653197F4F4F919923797C42ADEC&View=%7BCD527605-9A7A-448D-9A35-67A33EF9F766%7D

With the same RegEx we get this result:

match,group,is_participating,start,end,content
1,0,yes,0,288,https://www.contoso.com/sites/marketing/documents/Shared%20Documents/Forms/AllItemA.aspx?RootFolder=%2Fsites%2Fmarketing%2Fdocuments%2FShared%20Documents%2FPFProduct%20Promotion%202001.docx&FolderCTID=0x012000F2A09653197F4F4F919923797C42ADEC&View=%7BCD527605-9A7A-448D-9A35-67A33EF9F766%7D
1,1,yes,8,56,www.contoso.com/sites/marketing/documents/Shared
1,2,yes,56,56,
1,3,yes,56,99,%20Documents/Forms/AllItemA.aspx?RootFolder
1,4,yes,99,99,
1,5,yes,100,188,%2Fsites%2Fmarketing%2Fdocuments%2FShared%20Documents%2FPFProduct%20Promotion%202001.docx
1,6,yes,189,199,FolderCTID
1,7,yes,200,240,0x012000F2A09653197F4F4F919923797C42ADEC
1,8,yes,241,245,View
1,9,yes,246,288,%7BCD527605-9A7A-448D-9A35-67A33EF9F766%7D
2,0,yes,288,288,
2,1,yes,288,288,
2,2,yes,288,288,
2,3,yes,288,288,
2,4,yes,288,288,
2,5,yes,288,288,
2,6,yes,288,288,
2,7,yes,288,288,
2,8,yes,288,288,
2,9,yes,288,288,

As you can see, the filename to match is:

PFProduct%20Promotion%202001.docx

but the RegEx matched:

%2Fsites%2Fmarketing%2Fdocuments%2FShared%20Documents%2FPFProduct%20Promotion%202001.docx

How can I get just the filenames and nothing else?


Solution

  • There is no language tagged, but if you know that you always have urls you might use

    (?<=[=\/]|%2F)(?:(?!%2F)[^?&\s\/])+\.\w+(?=[?&]|$)
    

    Explanation

    • (?<= Positive lookbehind, assert what is to the left of the current position is
      • [=\/] Match either = or /
      • | Or
      • %2F Match literally
    • ) Close the lookbehind
    • (?: Non capture group
      • (?!%2F)[^?&\s\/] Match 1 char other than what is listed in the character class if %2F is not directly to the right of the current position
    • )+ Close the non capture group and repeat 1+ times
    • \.\w+ Match a dot and 1 or more word characters
    • (?=[?&]|$) Positive lookahead, assert either ? or & or the end of the string directly to the right of the current position

    Regex demo


    Other variations

    Or with a capture group if the lookbehind does not work with not fixed width:

    (?:[=\/]|%2F)((?:(?!%2F)[^?&\s\/])+\.\w+)(?=[?&]|$)
    

    Regex demo

    In languages where an infinite quantifier in the lookbehind is supported:

    (?<=https?:\/\/\S*(?:[=\/]|%2F))(?:(?!%2F)[^?&\s\/])+\.\w+(?=[?&]|$)
    

    Regex demo