Search code examples
htmlparsingmarkdownmarkuptextile

Is there a standard approach for parsing inline modifiers in Textile/Markdown


I've been looking at writing a Textile parser using Scala's parser combinator library (basically a PEG parser), and was wondering what kind of approach I should use for parsing the inline modifiers

This is *bold* text, _italic_ text, +underlined+ text, etc.

in this case it's pretty clear what's what, and what should be parsed. However, there are a large number of edge cases where it's not so clear. Focusing only on bold text:

Which sections get bolded: 
*onomato*poeia* ?
bold *word*, without a space after?
tyr*annos*aurus
a bold word in a (*bracket*)?
How about *This *case?

Obviously this is a mix of subjective (which things should count as bold) and objective (how to make the parsing rules parse it correctly).

I'm leaning towards a PEG something like

wordChar = [a-zA-Z]
nonWordChar = [^a-zA-Z]
boldStart = nonWordChar ~ * ~ wordChar
boldEnd = wordChar ~ * ~ nonWordChar
boldSection = boldStart ~ rep(not(boldEnd) ~ anyChar) ~ boldEnd

Which would parse the above as follows:

<b>onomato*poeia</b> ?
bold <b>word</b>, without a space after?
tyr*annos*aurus    <- fails because of lack of whitespace
a bold word in a (<b>bracket</b>)?
How about *This *case? <- fails because there is no correct closing *

However I'm not sure if this method holds for all use cases and is well defined for all edge cases. Is there a standard way of doing this which I can copy and rely on? I'd rather not rely on my ad-hoc not-well-thought-through language spec if I can avoid it.


Solution

  • There is no standard in the case of markdown, and implementations differ on edge cases. For one set of choices in the case of markdown, you could look at peg-markdown, which is also used in MultiMarkdown. Of course, markdown is more complex than textile in this respect, because it uses ** for bold and * for italics, giving rise to even more decisions about how to treat things like *hello**there**.

    Michel Fortin, developer of PHP markdown extra, has a test suite that includes a number of edge cases for bold/italics. However, I don't think there is universal agreement on his decisions here, and many implementations parse differently.

    That said, I think the following decisions are fairly uncontroversial in markdown:

    • * only starts emphasis if the next character is non-whitespace.
    • * only ends emphasis if the preceding character is non-whitespace.
    • Emphasis can occur within a word, so in he*ll*o, the two l's are emphasized (though some markdown implementations disable this feature for the _ character, since underscores are common in identifiers).