Find all the url links in a html text using regex Arguments. below text assigned to html vaiable.
html = """
<a href="#fragment-only">anchor link</a>
<a id="some-id" href="/relative/path#fragment">relative link</a>
<a href="//other.host/same-protocol">same-protocol link</a>
<a href="https://example.com">absolute URL</a>
"""
output should be like that:
["/relative/path","//other.host/same-protocol","https://example.com"]
The function should ignore fragment identifiers (link targets that begin with #). I.e., if the url points to a specific fragment/section using the hash symbol, the fragment part (the part starting with #) of the url should be stripped before it is returned by the function
//I have tried this bellow one but not working its only give output: ["https://example.com"]
urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', html)
print(urls)
You could try using positive lookbehind to find the quoted strings in front of href=
in html
pattern = re.compile(r'(?<=href=\")(?!#)(.+?)(?=#|\")')
urls = re.findall(pattern, html)
See this answer for more on how matching only up to the '#' character works, and here if you want a breakdown of the RegEx overall