I'm using a snippet i found on stackexchange that finds all url's in a string, using re.findall(). It works perfectly, however to further my knowledge I would like to know how exactly it works. The code is as follows-
re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', site)
As far as i understand, its finding all strings starting with http or https (is that why the [s]
is in square brackets?) but I'm not really sure about all the stuff after- the (?:[etc etc etc]))+
. I think the stuff in the square brackets eg. [a-zA-Z]
is meaning all letters from a to z caps or not, but what about the rest of the stuff? And how is it working to only get the url and not random string at the end of the url?
Thanks in advance :)
Using this link you can get your regex explained: Your regex explained
To add a bit more:
[s]?
means "an optional 's' character" but that's because of the ?
not of the brackets [I think they are superfluous.
Space isn't one of the accepted characters so it would stop there indeed. Same for '/'. It is not literally mentioned nor is it part of the character range $-_
(see http://www.asciitable.com/index/asciifull.gif).
(?:%[0-9a-fA-F][0-9a-fA-F])
this matches hexadecimal character codes in URLs e.g. %2f for the '/' character.
A non-capturing group means that the group is matched but that the resulting match is not stored in the regex return value, i.e. you cannot extract that matching bit of the string after the regex has been run against your string.