Search code examples
pythonregexpython-re

finding href path value Python regex


Find all the url links in a html text using regex Arguments. below text assigned to html vaiable.

html = """
<a href="#fragment-only">anchor link</a>
<a id="some-id" href="/relative/path#fragment">relative link</a>
<a href="//other.host/same-protocol">same-protocol link</a>
<a href="https://example.com">absolute URL</a>
"""

output should be like that:

["/relative/path","//other.host/same-protocol","https://example.com"]

The function should ignore fragment identifiers (link targets that begin with #). I.e., if the url points to a specific fragment/section using the hash symbol, the fragment part (the part starting with #) of the url should be stripped before it is returned by the function

//I have tried this bellow one but not working its only give output: ["https://example.com"]

 urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', html)
 print(urls)

Solution

  • You could try using positive lookbehind to find the quoted strings in front of href= in html

    pattern = re.compile(r'(?<=href=\")(?!#)(.+?)(?=#|\")')
    urls = re.findall(pattern, html)
    

    See this answer for more on how matching only up to the '#' character works, and here if you want a breakdown of the RegEx overall