Search code examples
htmlregexpcre

RegEx matching for HTML and non-HTML URLs


I'm trying to get all urls from this text. The absolute and relative URLs, but I'm not getting the right regular expression. The expression is combining with more things than I would like. You are getting HTML tags and other information that I do not want.

Attempt

(\w*.)(\\\/){1,}(.*)(?![^"])

Input

<div class=\"loader\">\n       <div class=\"loaderImage\"><img src=\"\/c\/Community\/Rating\/img\/loader.gif\" \/><\/div>\n    <\/div>\n<\/div>\n<\/div><\/span><\/span>\n
   <a title=\"Avengers\" href=\"\/pt\/movie\/Avengers\/57689\" >Avengers<\/a>                                                                                                                        <\/div>\n         
<img title=\"\" alt=\"\" id=\"145793\" src=\"https:\/\/images04-cdn.google.com\/movies\/74932\/74932_02\/previews\/2\/128\/top_1_307x224\/74932_02_01.jpg\" class=\"tlcImageItem img\"  width=\"307\"   height=\"224\"  \/>
pageLink":"\/pt\/videos\/\/updates\/1\/0\/Category\/0","previousPage":"\/pt\/videos\/\/updates\/1\/0\/Category\/0","nextUrl":"\/pt\/videos\/\/updates\/2\/0\/Category\/0","method":"updates","type":"scenes","callbackJs"
<span class=\"value\">4<\/span>\n        <\/div>\n          <\/div>\n    <div class=\"loader\">\n       <div class=\"loaderImage\"><img src=\"\/c\/Community\/Rating\/img\/loader.gif\" \/><\/div>\n    <\/div>\n<\/div>\n<\/div><\/span><\/span>

Demo


Solution

  • As it has been commented, it may not really be the best idea that you solve this problem with RegEx. However, if you wish to practice or you really have to, you may do an exact match in between "" where you URLs are present. You can bound them from left using scr, href, or any other fixed components that you may have. You can simply use an | and list them in the first group ().

    RegEx 1 for HTML URLs

    This RegEx may not be the right solution, but it might give you a perspective that how you might approach solving this problem using RegEx:

    (src=|href=)(\\")([a-zA-Z\\\/0-9\.\:_-]+)(")
    

    It creates four groups, so that to simplify updating it, and the $3 group might be your desired URLs. You can add any chars that your URLs might have in the third group.

    enter image description here

    RegEx 2 for both HTML and non-HTML URLs

    For capturing other non-HTML URLs, you can update it similar to this RegEx:

    (src=\\|href=\\|pageLink\x22:|previousPage\x22:|nextUrl\x22:)(")([a-zA-Z\\\/0-9\.\:_-]+)(") 
    

    where \x22 stands for ", which you can simply replace it. I have just added \x22 such that you could see those ", where your target URLs are located in between:

    enter image description here

    The second RegEx also has four groups, where the target group is $3. You can also simplify or DRY it, if you wish.