I'm trying to extract the text of a web article that includes links as part of the text. An example of this would be:
<p>Here is some text with <a href="https://www.example.com"> this part as a link</a>
which we will look at.</p>
I've tried using
table.findAll('p', text = True)
on the data, but this command ignores all 'p' tags which contain url's (that is, it wouldn't pick up the example in the first block). My question is, how can I extract the text from 'p' tags while also including the embedded links and how can I remove the url of the link and only keep the 'this part as a link' highlighted text? Any help is greatly appreciated.
Essentialy like this:
>>> import bs4
>>> HTML = '''\
... <p>Here is some text with <a href="https://www.example.com"> this part as a link</a>
... which we will look at.</p>'''
>>> soup = bs4.BeautifulSoup(HTML, 'lxml')
>>> [p.text for p in soup.findAll('p')]
['Here is some text with this part as a link\nwhich we will look at.']
Of course you would most likely want to replace new lines and redundant blanks.