Search code examples
node.jsregexgeturl

Retrieve relative urls from a text


I have a string of HTML with both absolute and relative URLs and I'm trying to retrieve only the relative URLs. I tried using the get-urls package but this only retrieves absolute URLs.

An example of the string of html received.

<!DOCTYPE>
<html>
<head>

<title>Our first HTML page</title>

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

</head>
<body>

<h2>Welcome to the web site: this is a heading inside of the heading tags.</h2>

<p>This is a paragraph of text inside the paragraph HTML tags. We can just keep writing ...
</p>

<h3>Now we have an image:</h3>

<div><img src="/images/plantTracing.gif" alt="Graphic of a Mouse Pad"></div>

<h3>
This is another heading inside of another set of headings tags; this time the tag is an 'h3' instead of an 'h2' , that means it is a less important heading.
</h3>

<h4>Yet another heading - right after this we have an HTML list:</h4>

<ol>
<li><a href="https://github.com/">First item in the list</a></li>
<li><a href="/modules/example.md"> Second item in the list</a></li>
<li>Third item in the list</li>
</ol>

<p>You will notice in the above HTML list, the HTML automatically creates the numbers in the list.</p>

<h3>About the list tags</h3>
</body>
</html>

Currently doing this

getUrls(string of HTML received)

It only returns {https://github.com/}

I want to return {https://github.com/, /modules/example.md}


Solution

  • The get-urls package requires the URL to either start with a scheme such as http:// or to start with a known top-level domain.

    In fact, the doc even contains this Require URLs to have a scheme or leading www. to be considered an URL.

    Since you're looking for relative paths that have neither of those, that package will not do what you want.

    You will probably benefit best from an actual HTML parser such as cheerio which find the HTML attribute based URLs based on HTML context, not on just text matching tricks as that will find all the paths that are relative URLs.