python parsing stack-overflow markdown markup

Reducing capabilities of markdown in python

I'm writing a comment system. It has to be have formatting system like stackoverflow's.

Users can use some inline markdown syntax like bold or italic. I thought that i can solve that need with using regex replacements.

But there is another thing i have to do: by giving 4 space indents users can create code blocks. I think that i can't do this by using regex. or parsing idents is too advanced usage for me :) Also, creating lists via using regex replacements looks like imposible for me.

What would be best approach for doing this?
Are there any markdown libraries that can i reduce capabilities of it? (for example i'll try to remove tables support)
If i should write my own parser, should i write a finite state machine from the scratch? or are there any other libraries to make it easier?

Thanks for giving time, and your responses.

Solution

I'd just go ahead and use python-markdown and monkey-patch it. You can write your own def_block_parser() function and substitute that in for the default one to disable some of the Markdown functionality:

from markdown import blockprocessors as bp
def build_block_parser(md_instance, **kwargs):
    """ Build the default block parser used by Markdown. """
    parser = bp.BlockParser(md_instance)
    parser.blockprocessors['empty'] = bp.EmptyBlockProcessor(parser)
    parser.blockprocessors['indent'] = bp.ListIndentProcessor(parser)
    # parser.blockprocessors['code'] = bp.CodeBlockProcessor(parser)
    parser.blockprocessors['hashheader'] = bp.HashHeaderProcessor(parser)
    parser.blockprocessors['setextheader'] = bp.SetextHeaderProcessor(parser)
    parser.blockprocessors['hr'] = bp.HRProcessor(parser)
    parser.blockprocessors['olist'] = bp.OListProcessor(parser)
    parser.blockprocessors['ulist'] = bp.UListProcessor(parser)
    parser.blockprocessors['quote'] = bp.BlockQuoteProcessor(parser)
    parser.blockprocessors['paragraph'] = bp.ParagraphProcessor(parser)
    return parser
bp.build_block_parser = build_block_parser

Note that I've simply copied and pasted the default build_block_processor() function from the blockprocessors.py file, tweaked it a bit (inserting bp. in front of all the names from that module), and commented out the line where it adds the code block processor. The resulting function is then monkey-patched back into the module. A similar method looks feasible for inlinepatterns.py, treeprocessor.py, preprocessor.py, and postprocessor.py, each of which does a different kind of processing.

Rather than rewriting the function that sets up the individual parsers, as I've done above, you could also patch out the parser classes themselves with do-nothing subclasses that would still be invoked but which would do nothing. That is probably simpler:

from markdown import blockprocessors as bp
class NoProcessing(bp.BlockProcessor):
    def test(self, parent, block):
        return False   # never invoke this processor

bp.CodeBlockProcessor = NoProcessing

There might be other Markdown libraries that more explicitly allow functionality to be disabled, but python-markdown looks like it is reasonably hackable.