Search code examples
pythonmarkdownrecursive-datastructures

Efficient way to convert DOM-like structure to markdown


So I have this DOM-like tree that I'm trying to convert to markdown. For example it can look like this

[
   {
      'type': 'header',
      'attr': {
         'size': 2
      },
      'children': [
         'A header, ',
         {
            'type': 'link',
            'attr': {
               'url': 'https://www.google.com'
            },
            'children': 'a link inside a header'
         }
      ]
   },
   'some more text'
]

and the output I want is

## A header, [a link inside a header](https://www.google.com)
some more text

I've tried

def genMd(tree):
   md_string = ''

   for element in tree:
      if type(element) == str:
         md_string += element

      elif type(element) == dict:
         if element['type'] == 'header':
            md_string += '{} {}\n'.format('#' * element['attr']['size'], genMd(element['children']))
 
         elif element['type'] == 'link':
            md_string += '[{}]({})'.format(genMd(element['children']), element['attr']['url']

         # I would add more if statements here for the other cases

   return md_string

which works, but it seems very inefficient and I would end up having tons of if statements. I've also tried this

def genMd(tree):
   MD_TABLE = {
      'header': '\'{} {}\\n\'.format(\'#\' * element[\'attr\'][\'size\'], genMd(element[\'children\']))',
      'link': '\'[{}]({})\'.format(genMd(element[\'children\']), element[\'attr\'][\'url\'])'
      # More entries here for the other cases
   }

   md_string = ''

   for element in tree:
      if type(element) == str:
         md_string += element
      
      elif type(element) == dict:
         md_string += eval(MD_TABLE[element['type']])

   return md_string

and it also works but it still feels wrong.

TL;DR: using if statements just feels wrong, is there a better way to do it?


Solution

  • Another approach could consist of using a generator function to traverse the DOM tree while keeping separate functions to handle the specific formatting of various types:

    def markdown(d):
       def m_header(a):
          yield f"{'#'*a['attr']['size']} "+' '.join(markdown(a['children']))
       def m_link(a):
          yield f'[{" ".join(markdown(a["children"]))}]({a["attr"]["url"]})'
       types = {'header':m_header, 'link':m_link}
       for i in ([d] if not isinstance(d, list) else d):
          if not isinstance(i, dict):
             yield i
          else:
             yield from types[i['type']](i)
    
    dom = [{'type': 'header', 'attr': {'size': 2}, 'children': ['A header, ', {'type': 'link', 'attr': {'url': 'https://www.google.com'}, 'children': 'a link inside a header'}]}, 'some more text']
    print('\n'.join(markdown(dom)))
    

    Output:

    ## A header,  [a link inside a header](https://www.google.com)
    some more text
    

    A couple observations:

    1. By using a generator with str.join, you don't need to continuously concatenate strings via +=, resulting in cleaner code and increased efficiency.
    2. Using functions to produce the markup for specific types is more maintainable and more secure than using string formatting with eval.