Search code examples
regexwww-mechanize

Regex in WWW::Mechanize in Perl


I am not sure what is the correct syntax for the url_regex used in WWW::Mechanize.

I am collecting all the links from a web page that start with an http:// and they are of the following format:

http://google.com

and not,

http://google.com/dir/
http://google.com/dir/dir2/

So, I use the following:

@links=$mech->find_all_links(url_regex=>qr/^http:\/\/.*?\//)

And this still captures the URLs with sub paths in them.

I have tested my regex on regexpal.com and it works good. But for some reason, url_regex expects a different syntax.

Thanks.


Solution

  • You should use:

    @links=$mech->find_all_links(url_regex=>qr/^http:\/\/[^\/]*\/?$/) 
    

    which reads:

    String has to start ^ with http:// followed by any combination (even none/empty) of characters others than slash [^\/]* followed by optional slash \/? at the end $.