Search code examples
pythonhtmlbeautifulsouphtml-parsing

Extract Links from HTML In Line with Text with Python/BeautifulSoup


There are many answers to how to convert HTML to text using BeautifulSoup (for example https://stackoverflow.com/a/24618186/3946214)

There are also many answers on how to extract links from HTML using BeautifulSoup.

What I need is a way to turn HTML into a text only version, but preserve links inline with the text that's near the link. For example, if I had some HTML that looked like this:

<div>Click <a href="www.google.com">Here</a> to receive a quote</div>

It would be nice to convert this to "Click Here (www.google.com) to receive a quote."

The usecase here is that I need to convert HTML for emails into a text only version, and it would be nice to have the links where they are semantically located in the HTML, instead of at the bottom. This exact syntax isn't required. I'd appreciate any guidance into how to do this. Thank you!


Solution

  • If you want beautifulsoup solution, you can start with this example (it probably needs more tuning with real-world data):

    data = '<div>Click <a href="www.google.com">Here</a> to receive a quote.</div>'
    
    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(data, 'html.parser')
    
    # append the text to the link
    for a in soup.select('a[href]'):
        a.contents.append(soup.new_string(' ({})'.format(a['href'])))
    
    # unwrap() all tags
    for tag in soup.select('*'):
        tag.unwrap()
    
    print(soup)
    

    Prints:

    Click Here (www.google.com) to receive a quote.