I've been looking at writing a Textile parser using Scala's parser combinator library (basically a PEG parser), and was wondering what kind of approach I should use for parsing the inline modifiers
This is *bold* text, _italic_ text, +underlined+ text, etc.
in this case it's pretty clear what's what, and what should be parsed. However, there are a large number of edge cases where it's not so clear. Focusing only on bold text:
Which sections get bolded:
*onomato*poeia* ?
bold *word*, without a space after?
tyr*annos*aurus
a bold word in a (*bracket*)?
How about *This *case?
Obviously this is a mix of subjective (which things should count as bold) and objective (how to make the parsing rules parse it correctly).
I'm leaning towards a PEG something like
wordChar = [a-zA-Z]
nonWordChar = [^a-zA-Z]
boldStart = nonWordChar ~ * ~ wordChar
boldEnd = wordChar ~ * ~ nonWordChar
boldSection = boldStart ~ rep(not(boldEnd) ~ anyChar) ~ boldEnd
Which would parse the above as follows:
<b>onomato*poeia</b> ?
bold <b>word</b>, without a space after?
tyr*annos*aurus <- fails because of lack of whitespace
a bold word in a (<b>bracket</b>)?
How about *This *case? <- fails because there is no correct closing *
However I'm not sure if this method holds for all use cases and is well defined for all edge cases. Is there a standard way of doing this which I can copy and rely on? I'd rather not rely on my ad-hoc not-well-thought-through language spec if I can avoid it.
There is no standard in the case of markdown, and implementations differ on edge cases. For one set of choices in the case of markdown, you could look at peg-markdown, which is also used in MultiMarkdown. Of course, markdown is more complex than textile in this respect, because it uses **
for bold and *
for italics, giving rise to even more decisions about how to treat things like *hello**there**
.
Michel Fortin, developer of PHP markdown extra, has a test suite that includes a number of edge cases for bold/italics. However, I don't think there is universal agreement on his decisions here, and many implementations parse differently.
That said, I think the following decisions are fairly uncontroversial in markdown:
*
only starts emphasis if the next character is non-whitespace.*
only ends emphasis if the preceding character is non-whitespace.he*ll*o
, the two l's are emphasized (though some markdown implementations disable this feature for the _
character, since underscores are common in identifiers).