Search code examples
javascriptregexquoting

Regular expression for translating quoting syntax to HTML


I have a quote syntax for my users, similar to SO:

So, Mike, you say:

>Lorem ipsum dolor sit amet, consectetur adipiscing elit.
>Nam mi dui, porta non gravida id
>sodales venenatis tellus

But this makes no sense!

There can also be a multiple quotes. I need to translate this into HTML markup like this, using JavaScript:

So, Mike, you say:

<div class="quote">
    Lorem ipsum dolor sit amet, consectetur adipiscing elit.
    Nam mi dui, porta non gravida id
    sodales venenatis tellus
</div>

But this makes no sense!

Here is the best that I came up with, but it implements HTML to every line, not the block of lines.

x = x.replace(/^&(amp;)?gt;([^\n]+)$/mg, "<div class=\"quote\"> $2 </div>");

Is it possible to write such a regular expression? If yes, what would it look like?


Solution

  • This is trying to manipulate HTML with a regular expression. (I say that based on the fact you're searching for HTML entities for > rather than literally for >.) That is almost always a bad idea, nearly as bad as trying to parse HTML with just a regular expression. Obligatory Link.

    You cannot do this with a single replace call. But to my surprise, you can do it with two, or with a single outer call that uses a function callback to make a bunch of inner calls.

    Here's the two-call version:

    x = x.replace(/^(?:(?:>|&gt;|&amp;gt;).*?[\r\n]+)+/gm, '<div class="quote">***$&</div>');
    x = x.replace(/(?:\*\*\*|^)(?:>|&gt;|&amp;gt;)/gm, '');
    

    The first finds repeated serieses of lines with the quote markers and wraps the div markup around them. (I included a raw > in the set, so it looks for >, &gt;, and &amp;gt; [see note below about that last one, though].) The second removes the quote markers. You can't remove them in the first replace because you're replacing the entire group of lines. Also note that I had to prefix the first marker, since once we've added the div markup, it's not at the beginning of a line anymore.

    Here's the one-call-with-subcalls version:

    x = x.replace(/^(?:(?:>|&gt;|&amp;gt;).*?[\r\n]+)+/gm, function(m) {
      return '<div class="quote">' + m.replace(/^(?:>|&gt;|&amp;gt;)/gm, '') + '</div>';
    });
    

    How robust are they? Probably not very, see the first paragraph of the answer above. :-)

    Side note: Your regex seems to be looking for &amp;gt; as a quote marker. I've preserved that above, but if you have &amp;gt; in your HTML, you have double-encoded HTML, which is usually an indicator of a problem elsewhere.

    Live Example of the two-calls version with variations for the various different quote markers:

    function test(x) {
      snippet.log("Before:");
      snippet.log(x);
      x = processString(x);
      snippet.log("After:");
      snippet.log(x);
      document.body.appendChild(document.createElement('hr'));
    }
    
    function processString(x) {
      x = x.replace(/^(?:(?:>|&gt;|&amp;gt;).*?[\r\n]+)+/gm, '<div class="quote">***$&</div>');
      x = x.replace(/(?:\*\*\*|^)(?:>|&gt;|&amp;gt;)/gm, '');
      return x;
    }
    
    test(
      "So, Mike, you say:\n" +
      "\n" +
      ">Lorem ipsum dolor sit amet, consectetur adipiscing elit.\n" +
      "&amp;gt;Nam mi dui, porta non gravida id\n" +
      "&gt;sodales venenatis tellus\n" +
      "\n" +
      "But this makes no sense!\n"
    );
    test(
      "So, Mike, you say:\n" +
      "\n" +
      "&amp;gt;Lorem ipsum dolor sit amet, consectetur adipiscing elit.\n" +
      ">Nam mi dui, porta non gravida id\n" +
      "&gt;sodales venenatis tellus\n" +
      "\n" +
      "But this makes no sense!\n"
    );
    test(
      "So, Mike, you say:\n" +
      "\n" +
      "&gt;Lorem ipsum dolor sit amet, consectetur adipiscing elit.\n" +
      "&amp;gt;Nam mi dui, porta non gravida id\n" +
      ">sodales venenatis tellus\n" +
      "\n" +
      "But this makes no sense!\n"
    );
    <!-- Script provides the `snippet` object, see http://meta.stackexchange.com/a/242144/134069 -->
    <script src="http://tjcrowder.github.io/simple-snippets-console/snippet.js"></script>

    Live Example of the one-call-with-subcalls version:

    function test(x) {
      snippet.log("Before:");
      snippet.log(x);
      x = processString(x);
      snippet.log("After:");
      snippet.log(x);
      document.body.appendChild(document.createElement('hr'));
    }
    
    function processString(x) {
      x = x.replace(/^(?:(?:>|&gt;|&amp;gt;).*?[\r\n]+)+/gm, function(m) {
        return '<div class="quote">' + m.replace(/^(?:>|&gt;|&amp;gt;)/gm, '') + '</div>';
      });
      return x;
    }
    
    test(
      "So, Mike, you say:\n" +
      "\n" +
      ">Lorem ipsum dolor sit amet, consectetur adipiscing elit.\n" +
      "&amp;gt;Nam mi dui, porta non gravida id\n" +
      "&gt;sodales venenatis tellus\n" +
      "\n" +
      "But this makes no sense!\n"
    );
    test(
      "So, Mike, you say:\n" +
      "\n" +
      "&amp;gt;Lorem ipsum dolor sit amet, consectetur adipiscing elit.\n" +
      ">Nam mi dui, porta non gravida id\n" +
      "&gt;sodales venenatis tellus\n" +
      "\n" +
      "But this makes no sense!\n"
    );
    test(
      "So, Mike, you say:\n" +
      "\n" +
      "&gt;Lorem ipsum dolor sit amet, consectetur adipiscing elit.\n" +
      "&amp;gt;Nam mi dui, porta non gravida id\n" +
      ">sodales venenatis tellus\n" +
      "\n" +
      "But this makes no sense!\n"
    );
    <!-- Script provides the `snippet` object, see http://meta.stackexchange.com/a/242144/134069 -->
    <script src="http://tjcrowder.github.io/simple-snippets-console/snippet.js"></script>