Search code examples
pythonweb-scrapingbeautifulsoupfindall

Extract text with embedded link using BeautifulSoup


I'm trying to extract the text of a web article that includes links as part of the text. An example of this would be:

<p>Here is some text with <a href="https://www.example.com"> this part as a link</a>
which we will look at.</p>

I've tried using

table.findAll('p', text = True)

on the data, but this command ignores all 'p' tags which contain url's (that is, it wouldn't pick up the example in the first block). My question is, how can I extract the text from 'p' tags while also including the embedded links and how can I remove the url of the link and only keep the 'this part as a link' highlighted text? Any help is greatly appreciated.


Solution

  • Essentialy like this:

    >>> import bs4
    >>> HTML = '''\
    ... <p>Here is some text with <a href="https://www.example.com"> this part as a link</a>
    ... which we will look at.</p>'''
    >>> soup = bs4.BeautifulSoup(HTML, 'lxml')
    >>> [p.text for p in soup.findAll('p')]
    ['Here is some text with  this part as a link\nwhich we will look at.']
    

    Of course you would most likely want to replace new lines and redundant blanks.