Search code examples
node.jsregexregex-lookaroundsregex-groupregex-greedy

Regex to match a list of alphanumeric ids of a JSON fragment inside HTML


I'm trying to compose a regular expression to match the following situation:

In a Node.js project I have a multiline string that contains a big HTML code mixed with some JS with this structure:

<html>
  <head>
  </head>
  <body>
    <script type="text/javascript">
      ... more code ...
      },
      "bookIds" : [
        "abc123",
        "qwe456",
        "asd789"
      ],
      ... more code, and in another json:
      },
      "bookIds" : [
        "foo111",
        "bar222",
        "baz333"
      ],
      ... more code ...
    </script>
  </body>
</html>

My goal is to get the first list of bookIds:

abc123
qwe456
asd789

So, as you can see, the conditions that I'm working with, for now, are:

  • Search the first "bookIds" : [ appearance and stop at the next ]

I got something like that with: /bookIds" : \[([\S\s]*?)\]/. Yeah, conceptually I though about finding the first string bookIds, start after the first [ after that, and stop before the next ], but I don't know how to do it. I'm now getting documented about lookahead & lookbehinds.

  • Now I need to search (or loop) inside that match and get what's inside quotes (I know how could I do that individually: /"(.*?)"/)

But unfortunately I've been hours googling and trying and I'm not getting it to work (neither in my Node project nor the tests I'm trying in regex101.com)

Any suggestions will be much appreciated!


Solution

  • You can use "bookIds"\s*:\s*\[([^\]]+?)] Demo

    let str = `<html>
      <head>
      </head>
      <body>
        <script type="text/javascript">
          "bookIds" : [
            "abc123",
            "qwe456",
            "asd789"
          ],
          "bookIds" : [
            "foo111",
            "bar222",
            "baz333"
          ],
        <\/script>
      <\/body>
    <\/html>`
    
    let op = str.match(/"bookIds"\s*:\s*\[([^\]]+?)]/m)
    console.log(op[1].replace(/[\s"]+/g,''))