Search code examples
pythonmarkdownpyparsingply

parse a string multiple delimiters returning list of tuples with style and text


I'm trying to parse a string which will have some markdown style delimiters in it. I need a list back with the styles. I've given it a go with pyparsing and have had some success, but feel there is probably a better method (basically using the post by mbeaches at http://pyparsing.wikispaces.com/).

Essentially, if I have a string

word_paragraph = "This is **bold** and this is *italic* sample"

I'd like to return a list of tuples after providing:

style_delim = {'Bold': '**', 'Italics':'*', } 
word_pg_parsed = somefunction(word_paragraph,style_delim)

which would result in word_pg_parsed as something like:

word_pg_parsed = [('Normal','This is '),('Bold','bold'),('Normal','and this is '),('Italics','italic'),('Normal',' sample')]

I've looked into markdown, but can't find where this functionality exists. I suspect there is a library (dug into PLY but couldn't find what I am after) that handles this properly.

Why? I'm attempting to create a word file using python-docx file including some text from some marked up text and need to handle the inline character styles accordingly. Is there something in python-markdown or other library anyone has seen that does this?


Solution

  • In the event someone is looking to do this, here's what I found. Many thanks to Waylan for pointing me to mistune and to lepture for the library.

    The default_output method was replaced with placeholder. That's the one you need to override to get the list instead of a string. Referenced here: https://github.com/lepture/mistune/pull/20

    Basically follow what is in the test case at: https://github.com/lepture/mistune/blob/878f92bdb224a8b7830e8c33952bd2f368e5d711/tests/test_subclassing.py The getattribute is indeed required, otherwise you'll errors about string functions being called on a list.

    Look for TokenTreeRenderer in the test_subclassing.py.

    Repeating here in a django views.py for my working sample:

    from django.shortcuts import render
    from .forms import ParseForm   # simple form with textarea field called markup
    import mistune
    
    
    class TokenTreeRenderer(mistune.Renderer):
        # options is required
        options = {}
    
        def placeholder(self):
            return []
    
        def __getattribute__(self, name):
            """Saves the arguments to each Markdown handling method."""
            found = TokenTreeRenderer.__dict__.get(name)
            if found is not None:
                return object.__getattribute__(self, name)
    
            def fake_method(*args, **kwargs):
                return [(name, args, kwargs)]
            return fake_method
    
    
    def parse(request):
        context = {}
        if request.method == 'POST':
            parse_form = ParseForm(request.POST)
            if parse_form.is_valid():
                # parse the data
                markdown = mistune.Markdown(renderer=TokenTreeRenderer())
                tokenized = markdown(parse_form.cleaned_data['markup'])
                context.update({'tokenized': tokenized, })
                # no need for a redirect in this case
    
        else:
            parse_form = ParseForm(initial={'markup': 'This is a **bold** text sample', })
    
        context.update({'form': parse_form, })
        return render(request, 'mctests/parse.html', context)
    

    This results in output of:

     [('paragraph', ([('text', (u'This is a ',), {}), ('double_emphasis', ([('text', (u'bold',), {})],), {}), ('text', (u' text sample',), {})],), {})]
    

    which works great for me.