Search code examples
javascriptregexhtml-parsing

How to parse an html string based on certain delimiters?


#202020#<font face="Helvetica">this is string entered by a # user #202021# </font><b style=""><font face="Helvetica Neue" style="">#<u>001</u>10#&nbsp;</font></b>

Expected result: #202020#<font face="Helvetica">this is string entered by a # user #202021# </font><b style=""><font face="Helvetica Neue" style="">#00110#<u></u>&nbsp;</font></b>

Given an html string like the above, I want to be able to rearrange characters delimited by '#'s and the 5 digit numbers included between them.

Right now I have been trying to use regex to strip the html and then split based on '#', but it doesn't work because there could be a '#' that isn't a part of my '#' markdown. I also don't know how to recombine my split arrays of string chunks, hmtl tags, and '#' number chunks. It doesn't matter if I strip or move out the styling on the '#12345#' part of the string, as long as those characters are grouped, so they cna be moved or wrapped around the hash markdown arbitrarily.

The reason for the above is that I have a wysiwyg component that has a requirement to store this '#12345#' formatted markdown, where on the server this is converted to a url based on a lookup table. On save I want to be able to format the '#' markdown. The wysiwyg editor I'm using is react-summernote.


Solution

  • You can do this using string.replace, a regex and a callback. The regex

    /#([^#]*\d)#/g
    

    looks for two # enclosing numbers and anything which is not a #.

    In the callback, you remove all non-digit characters, and count the remaining numbers. If there are five numbers, you return the numbers enclosed in #, otherwise, you do nothing, you just return the original tag.

    You can also use positive lookahead and lookbehinds to not capture the #, in which case you won't need to include the hashes when replacing the tag

    /(?<=#)([^#]*\d)(?=#)/g
    

    const func = str => str.replace(/(?<=#)([^#]*\d)(?=#)/g, (a, tag) => {
      const numbers = tag.replace(/\D/g, ''); // remove non-digits
      if (numbers.length === 5) {
        return numbers; // return the numbers part of the tag
      }
      return tag; // return tag untouched
    });
    
    console.log(func('#12<b>345</b>6#'));
    console.log(func('#1<b>2</b>34#'));
    console.log(func('#12345#'));
    console.log(func('#1<b>234</b>5#'));