I need to extract ONLY the file names from any URL. I looked at all previous answers on stackoverflow regarding URLs and filenames, but no one considered the case of a file name with escaped characters.
I have for example an URL like this:
https://content.com/pbpython.py/notebooks/thirsty-allies.mov?file=The%20Big%20Kahuna.webm.tar.gz&f=Crosstab%20Explained.ipynb&a=b&m=plok%202001.tar.gz
I tried many RegEx, and finally I found one that did not split the file names when it encounter the escaped character:
"(?:\w*:\/\/)?((?:[\w-_]*\.?)+:?\d*(?:\/?[\w-_.]+\/?)*)[\?]?[\&]?([\w-_]*)?=?([\/\w-_\?\)%\.\(\*\[\]\^<>\\]*)?[\&]?([\w-_]*)?=?([\/\w-_\?\)%\.\(\*\[\]\^<>\\]*)?[\&]?([\w-_]*)?=?([\/\w-_\?\)%\.\(\*\[\]\^<>\\]*)?[\&]?([\w-_]*)?=?([\/\w-_\?\)%\.\(\*\[\]\^<>\\]*)?"g
You can test it here: https://regex101.com/r/LRWlif/7
The results are a mess:
match,group,is_participating,start,end,content
1,0,yes,0,148,https://content.com/pbpython.py/notebooks/thirsty-allies.mov?file=The%20Big%20Kahuna.webm.tar.gz&f=Crosstab%20Explained.ipynb&a=b&m=plok%202001.tar.gz
1,1,yes,8,60,content.com/pbpython.py/notebooks/thirsty-allies.mov
1,2,yes,61,65,file
1,3,yes,66,94,The%20Big%20Kahuna.webm.tar.gz
1,4,yes,95,96,f
1,5,yes,97,123,Crosstab%20Explained.ipynb
1,6,yes,124,125,a
1,7,yes,126,127,b
1,8,yes,128,129,m
1,9,yes,130,148,plok%202001.tar.gz
2,0,yes,148,148,
2,1,yes,148,148,
2,2,yes,148,148,
2,3,yes,148,148,
2,4,yes,148,148,
2,5,yes,148,148,
2,6,yes,148,148,
2,7,yes,148,148,
2,8,yes,148,148,
2,9,yes,148,148,
The only good thing is that the filenames are all matched somehow, with no split parts, with the exception of "thirsty-allies.mov" that is matched along some url parts.
Also there is the issue that not all escape characters can be part of a filename. %2F for example is the "/" that separate folders in paths, and should not considered part of the match.
For example:
https://www.contoso.com/sites/marketing/documents/Shared%20Documents/Forms/AllItemA.aspx?RootFolder=%2Fsites%2Fmarketing%2Fdocuments%2FShared%20Documents%2FPFProduct%20Promotion%202001.docx&FolderCTID=0x012000F2A09653197F4F4F919923797C42ADEC&View=%7BCD527605-9A7A-448D-9A35-67A33EF9F766%7D
With the same RegEx we get this result:
match,group,is_participating,start,end,content
1,0,yes,0,288,https://www.contoso.com/sites/marketing/documents/Shared%20Documents/Forms/AllItemA.aspx?RootFolder=%2Fsites%2Fmarketing%2Fdocuments%2FShared%20Documents%2FPFProduct%20Promotion%202001.docx&FolderCTID=0x012000F2A09653197F4F4F919923797C42ADEC&View=%7BCD527605-9A7A-448D-9A35-67A33EF9F766%7D
1,1,yes,8,56,www.contoso.com/sites/marketing/documents/Shared
1,2,yes,56,56,
1,3,yes,56,99,%20Documents/Forms/AllItemA.aspx?RootFolder
1,4,yes,99,99,
1,5,yes,100,188,%2Fsites%2Fmarketing%2Fdocuments%2FShared%20Documents%2FPFProduct%20Promotion%202001.docx
1,6,yes,189,199,FolderCTID
1,7,yes,200,240,0x012000F2A09653197F4F4F919923797C42ADEC
1,8,yes,241,245,View
1,9,yes,246,288,%7BCD527605-9A7A-448D-9A35-67A33EF9F766%7D
2,0,yes,288,288,
2,1,yes,288,288,
2,2,yes,288,288,
2,3,yes,288,288,
2,4,yes,288,288,
2,5,yes,288,288,
2,6,yes,288,288,
2,7,yes,288,288,
2,8,yes,288,288,
2,9,yes,288,288,
As you can see, the filename to match is:
PFProduct%20Promotion%202001.docx
but the RegEx matched:
%2Fsites%2Fmarketing%2Fdocuments%2FShared%20Documents%2FPFProduct%20Promotion%202001.docx
How can I get just the filenames and nothing else?
There is no language tagged, but if you know that you always have urls you might use
(?<=[=\/]|%2F)(?:(?!%2F)[^?&\s\/])+\.\w+(?=[?&]|$)
Explanation
(?<=
Positive lookbehind, assert what is to the left of the current position is
[=\/]
Match either =
or /
|
Or%2F
Match literally)
Close the lookbehind(?:
Non capture group
(?!%2F)[^?&\s\/]
Match 1 char other than what is listed in the character class if %2F
is not directly to the right of the current position)+
Close the non capture group and repeat 1+ times\.\w+
Match a dot and 1 or more word characters(?=[?&]|$)
Positive lookahead, assert either ?
or &
or the end of the string directly to the right of the current positionOther variations
Or with a capture group if the lookbehind does not work with not fixed width:
(?:[=\/]|%2F)((?:(?!%2F)[^?&\s\/])+\.\w+)(?=[?&]|$)
In languages where an infinite quantifier in the lookbehind is supported:
(?<=https?:\/\/\S*(?:[=\/]|%2F))(?:(?!%2F)[^?&\s\/])+\.\w+(?=[?&]|$)