Search code examples
javascripthtmlnode.jsregular-language

Retrieve hyperlink value from <li> in HTML body tag?


I need to get the href value in HTML body tag by regular expression

<html>
    <head>
  </head>
  <body class="directory">
    <input id="search" type="text" placeholder="Search" autocomplete="off" />
    <div id="wrapper">
      <h1><a href="/">~</a> / <a href="/public">public</a> / <a href="/public/img">img</a> / <a href="/public/img/events">events</a> / <a href="/public/img/events/poster">poster</a> / </h1>
      <ul id="files" class="view-tiles"><li><a href="/public/img/events" class="" title=".."><span class="name">..</span><span class="size"></span><span class="date"></span></a></li>
<li><a href="/public/img/events/poster/2018-09-26-1.PNG" class="" title="2018-09-26-1.PNG"><span class="name">2018-09-26-1.PNG</span><span class="size">1406471</span><span class="date">2018-9-16 18:37:23</span></a></li>
<li><a href="/public/img/events/poster/2018-09-26-2.PNG" class="" title="2018-09-26-2.PNG"><span class="name">2018-09-26-2.PNG</span><span class="size">530859</span><span class="date">2018-9-16 18:37:44</span></a></li>
<li><a href="/public/img/events/poster/2018-09-26-3.PNG" class="" title="2018-09-26-3.PNG"><span class="name">2018-09-26-3.PNG</span><span class="size">551409</span><span class="date">2018-9-16 18:38:24</span></a></li>
<li><a href="/public/img/events/poster/test" class="" title="test"><span class="name">test</span><span class="size">0</span><span class="date">2018-10-4 20:16:58</span></a></li></ul>
    </div>
  </body>
<html>

I want to have a list that contains

/public/img/events/poster/2018-09-26-1.PNG and 
/public/img/events/poster/2018-09-26-2.PNG and
/public/img/events/poster/2018-09-26-3.PNG.

The expression i used :

/[<body\sclass="directory">].+[<li><a\shref\s*=\s*\"]([^">]+)\"\s+[class].+[<\/body>]/g

However i got the result:

<ul id="files" class="view-tiles"><li><a href="/public/img/events" class="" title=".."><span class="name">..</span><span class="size"></span><span class="date"></span></a></li>

<li><a href="/public/img/events/poster/2018-09-26-1.PNG" class="" title="2018-09-26-1.PNG"><span class="name">2018-09-26-1.PNG</span><span class="size">1406471</span><span class="date">2018-9-16 18:37:23</span></a></li>

<li><a href="/public/img/events/poster/2018-09-26-2.PNG" class="" title="2018-09-26-2.PNG"><span class="name">2018-09-26-2.PNG</span><span class="size">530859</span><span class="date">2018-9-16 18:37:44</span></a></li>

<li><a href="/public/img/events/poster/2018-09-26-3.PNG" class="" title="2018-09-26-3.PNG"><span class="name">2018-09-26-3.PNG</span><span class="size">551409</span><span class="date">2018-9-16 18:38:24</span></a></li>

<li><a href="/public/img/events/poster/test" class="" title="test"><span class="name">test</span><span class="size">0</span><span class="date">2018-10-4 20:16:58</span></a></li></ul>

Can someone guide me please?


Solution

  • You can use this regex:

    /<li[^>]*>[^<]*<a[^>]*href="([^"]+)"/g

    and then access the href="([^"]+) capturing group by calling match[1] like follows (assuming you are using javascript):

        var myString = `<html>
        <head>
      </head>
      <body class="directory">
        <input id="search" type="text" placeholder="Search" autocomplete="off" />
        <div id="wrapper">
          <h1><a href="/">~</a> / <a href="/public">public</a> / <a href="/public/img">img</a> / <a href="/public/img/events">events</a> / <a href="/public/img/events/poster">poster</a> / </h1>
          <ul id="files" class="view-tiles"><li><a href="/public/img/events" class="" title=".."><span class="name">..</span><span class="size"></span><span class="date"></span></a></li>
    <li><a href="/public/img/events/poster/2018-09-26-1.PNG" class="" title="2018-09-26-1.PNG"><span class="name">2018-09-26-1.PNG</span><span class="size">1406471</span><span class="date">2018-9-16 18:37:23</span></a></li>
    <li><a href="/public/img/events/poster/2018-09-26-2.PNG" class="" title="2018-09-26-2.PNG"><span class="name">2018-09-26-2.PNG</span><span class="size">530859</span><span class="date">2018-9-16 18:37:44</span></a></li>
    <li><a href="/public/img/events/poster/2018-09-26-3.PNG" class="" title="2018-09-26-3.PNG"><span class="name">2018-09-26-3.PNG</span><span class="size">551409</span><span class="date">2018-9-16 18:38:24</span></a></li>
    <li><a href="/public/img/events/poster/test" class="" title="test"><span class="name">test</span><span class="size">0</span><span class="date">2018-10-4 20:16:58</span></a></li></ul>
        </div>
      </body>
    <html>`;
    
    var myRegexp = /<li[^>]*>[^<]*<a[^>]*href="([^"]+)"/g;
    match = myRegexp.exec(myString);
    while (match != null) {
      // matched text: match[0]
      // match start: match.index
      // capturing group n: match[n]
      console.log(match[1])
      match = myRegexp.exec(myString);
    }

    Credits to this answer for the code example.


    Update 1

    Author asked to include a match for the body tag

    Just curious. How do i update the express if i want to limit mapping range in tag? I update the express as belows but no result. ]>.]>[^<]]href="([^"]+)".</body[^>]*>

    There is only so much you can do with a regex and in general it's not recommended doing advanced html parsing with regexes. Your approach gives you problems with the linebreaks and the fact that you want to match multiple lis in a single body. Also, by HTML convention, <li>s are only allowed in the body.

    If you want to do so, break it down into two steps and match the

        var myString = `<html>
        <head>
        <!-- Not valid HTML, just for testing -->
        <ul id="files" class="view-tiles"><li><a href="/public/img/events" class="" title=".."><span class="name">..</span><span class="size"></span><span class="date"></span></a></li>
        <li><a href="/public/img/events/poster/2018-09-26-1.PNG" class="" title="2018-09-26-1.PNG"><span class="name">2018-09-26-1.PNG</span><span class="size">1406471</span><span class="date">2018-9-16 18:37:23</span></a></li>
        <li><a href="/public/img/events/poster/2018-09-26-2.PNG" class="" title="2018-09-26-2.PNG"><span class="name">2018-09-26-2.PNG</span><span class="size">530859</span><span class="date">2018-9-16 18:37:44</span></a></li>
        <li><a href="/public/img/events/poster/2018-09-26-3.PNG" class="" title="2018-09-26-3.PNG"><span class="name">2018-09-26-3.PNG</span><span class="size">551409</span><span class="date">2018-9-16 18:38:24</span></a></li>
        <li><a href="/public/img/events/poster/test" class="" title="test"><span class="name">test</span><span class="size">0</span><span class="date">2018-10-4 20:16:58</span></a></li></ul>
      </head>
      <body class="directory">
        <input id="search" type="text" placeholder="Search" autocomplete="off" />
        <div id="wrapper">
          <h1><a href="/">~</a> / <a href="/public">public</a> / <a href="/public/img">img</a> / <a href="/public/img/events">events</a> / <a href="/public/img/events/poster">poster</a> / </h1>
          <ul id="files" class="view-tiles"><li><a href="/public/img/events" class="" title=".."><span class="name">..</span><span class="size"></span><span class="date"></span></a></li>
    <li><a href="/public/img/events/poster/2018-09-26-1.PNG" class="" title="2018-09-26-1.PNG"><span class="name">2018-09-26-1.PNG</span><span class="size">1406471</span><span class="date">2018-9-16 18:37:23</span></a></li>
    <li><a href="/public/img/events/poster/2018-09-26-2.PNG" class="" title="2018-09-26-2.PNG"><span class="name">2018-09-26-2.PNG</span><span class="size">530859</span><span class="date">2018-9-16 18:37:44</span></a></li>
    <li><a href="/public/img/events/poster/2018-09-26-3.PNG" class="" title="2018-09-26-3.PNG"><span class="name">2018-09-26-3.PNG</span><span class="size">551409</span><span class="date">2018-9-16 18:38:24</span></a></li>
    <li><a href="/public/img/events/poster/test" class="" title="test"><span class="name">test</span><span class="size">0</span><span class="date">2018-10-4 20:16:58</span></a></li></ul>
        </div>
      </body>
    <html>`;
    
    var bodyRegex = /<\s*body.*>([\s\S]*)<\s*\/body>/g;
    var bodyString = bodyRegex.exec(myString)[0];
    
    var myRegexp = /<li[^>]*>[^<]*<a[^>]*href="([^"]+)"/g;
    match = myRegexp.exec(bodyString);
    while (match != null) {
      // matched text: match[0]
      // match start: match.index
      // capturing group n: match[n]
      console.log(match[1])
      match = myRegexp.exec(bodyString);
    }