Search code examples
pythonregexmarkdown

Extracting URL and anchor text from Markdown using Python


I am attempting to extract anchor text and associated URLs from Markdown. I've seen this question. Unfortunately, the answer doesn't seem to fully answer what I want.

In Markdown, there are two ways to insert a link:

Example 1:

[anchor text](http://my.url)

Example 2:

[anchor text][2]

   [1]: http://my.url

My script looks like this (note that I am using regex, not re):

import regex
body_markdown = "This is an [inline link](http://google.com). This is a [non inline link][4]\r\n\r\n  [1]: http://yahoo.com"

rex = """(?|(?<txt>(?<url>(?:ht|f)tps?://\S+(?<=\P{P})))|\(([^)]+)\)\[(\g<url>)\])"""
pattern = regex.compile(rex)
matches = regex.findall(pattern, body_markdown, overlapped=True)
for m in matches:
    print m

This produces the output:

('http://google.com', 'http://google.com')
('http://yahoo.com', 'http://yahoo.com')

My expected output is:

('inline link', 'http://google.com')
('non inline link', 'http://yahoo.com')

How can I properly capture the anchor text from Markdown?


Solution

  • How can I properly capture the anchor text from Markdown?

    Parse it into a structured format (e.g., html) and then use the appropriate tools to extract link labels and addresses.

    import markdown
    from lxml import etree
    
    body_markdown = "This is an [inline link](http://google.com). This is a [non inline link][1]\r\n\r\n  [1]: http://yahoo.com"
    
    doc = etree.fromstring(markdown.markdown(body_markdown))
    for link in doc.xpath('//a'):
      print link.text, link.get('href')
    

    Which gets me:

    inline link http://google.com
    non inline link http://yahoo.com
    

    The alternative is writing your own Markdown parser, which seems like the wrong place to focus your effort.