Search code examples
pythonmarkdown

How can I get a list of image URLs from a Markdown file in Python?


I'm looking for something like this:

data = '''
**this is some markdown**
blah blah blah
![image here](http://somewebsite.com/image1.jpg)
![another image here](http://anotherwebsite.com/image2.jpg)
'''

print get_images_url_from_markdown(data)

that returns a list of image URLs from the text:

['http://somewebsite.com/image1.jpg', 'http://anotherwebsite.com/image2.jpg']

Is there anything available, or do I have to scrape Markdown myself with BeautifulSoup?


Solution

  • Python-Markdown has an extensive Extension API. In fact, the Table of Contents Extension does essentially what you want with headings (instead of images) plus a bunch of other stuff you don't need (like adding unique id attributes and building a nested list for the TOC).

    After the document is parsed, it is contained in an ElementTree object and you can use a treeprocessor to extract the data you want before the tree is serialized to text. Just be aware that if you have included any images as raw HTML, this will fail to find those images (you would need to parse the HTML output and extract in that case).

    Start off by following this tutorial, except that you will need to create a treeprocessor rather than an inline Pattern. You should end up with something like this:

    import markdown
    from markdown.treeprocessors import Treeprocessor
    from markdown.extensions import Extension
    
    # First create the treeprocessor
    
    class ImgExtractor(Treeprocessor):
        def run(self, doc):
            "Find all images and append to markdown.images. "
            self.md.images = []
            for image in doc.findall('.//img'):
                self.md.images.append(image.get('src'))
    
    # Then tell markdown about it
    
    class ImgExtExtension(Extension):
        def extendMarkdown(self, md):
            img_ext = ImgExtractor(md)
            md.treeprocessors.register(img_ext, 'img_ext', 15)
    
    # Finally create an instance of the Markdown class with the new extension
    
    md = markdown.Markdown(extensions=[ImgExtExtension()])
    
    # Now let's test it out:
    
    data = '''
    **this is some markdown**
    blah blah blah
    ![image here](http://somewebsite.com/image1.jpg)
    ![another image here](http://anotherwebsite.com/image2.jpg)
    '''
    html = md.convert(data)
    print(md.images)
    

    The above outputs:

    [u'http://somewebsite.com/image1.jpg', u'http://anotherwebsite.com/image2.jpg']
    

    If you really want a function which returns the list, just wrap that all up in one and you're good to go.