Search code examples
pythonhtmlbeautifulsoupurllib3

Given an html paragraph and a link, is there a way to retrieve the text before and the text after the link inside the paragraph in Python?


I am using urllib3 to get the html of some pages.

I want to retrieve the text from the paragraph where the link is, with the text before and after the link stored separately.

For example:

import urllib3
from bs4 import BeautifulSoup

http = urllib3.PoolManager()
r = http.request('get', "https://www.snopes.com/fact-check/michael-novenche/")
body = r.data
soup = BeautifulSoup(body, 'lxml')
for a in soup.findAll('a'):
    if a.has_attr('href'):
        if (a['href'] == "http://web.archive.org/web/20040330161553/http://newyork.local.ie/content/31666.shtml/albany/news/newsletters/general"):
            link_text = a
            link_para = a.find_parent("p")
            print(link_text)
            print(link_para)

Paragraph

<p>The message quoted above about Michael Novenche, a two-year-old boy 
undergoing chemotherapy to treat a brain tumor, was real, but keeping up with 
all the changes in his condition proved a challenge.  The message quoted above 
stated that Michael had a large tumor in his brain, was operated upon to 
remove part of the tumor, and needed prayers to help him through chemotherapy 
to a full recovery.  An <nobr>October 2000</nobr> article in <a 
href="http://web.archive.org/web/20040330161553/http://newyork.local.ie/conten
t/31666.shtml/albany/news/newsletters/general" 
onmouseout="window.status='';return true" onmouseover="window.status='The
Local Albany Weekly';return true" target="_blank"><i>The Local Albany 
Weekly</i></a> didn’t mention anything about little Michael’s medical 
condition but said that his family was “in need of funds to help pay for the
 transportation to the hospital and other costs not covered by their 
insurance.”  A June 2000 message posted to the <a 
href="http://www.ecunet.org/whatisecupage.html" 
onmouseout="window.status='';return true" 
onmouseover="window.status='Ecunet';return true" target="_blank">Ecunet</a> 
mailing list indicated that Michael had just turned <nobr>3 years</nobr> old, 
mentioned that his tumor appeared to be shrinking, and provided a mailing 
address for him:</p>

Link

<a href="http://web.archive.org/web/20040330161553/http://newyork.local.ie/conten
t/31666.shtml/albany/news/newsletters/general"
onmouseout="window.status='';return true" onmouseover="window.status='The 
Local Albany Weekly';return true" target="_blank"><i>The Local Albany 
Weekly</i></a>

Text to be retrieved (2 parts)

The message quoted above about Michael Novenche, a two-year-old boy 
undergoing chemotherapy ... was operated upon to 
remove part of the tumor, and needed prayers to help him through chemotherapy 
to a full recovery.  An October 2000 article in
didn’t mention anything about little Michael’s medical 
condition but said that his family was ... turned 3 years old, 
mentioned that his tumor appeared to be shrinking, and provided a mailing 
address for him:

I cant simply get_text() then use split as the link text might be repeated.

I thought I might just add a counter to see how many times the link text is repeated, use split(), then use a loop to get the parts I want.

I would appreciate a better, less messy method though.


Solution

  • I found a solution based on @Andrej kesely's solution.

    It deals with two problems:

    1. That there is no text before/after the link

    2. That the link isn't a direct child of the paragraph

    Here it is (as a function):

    import urllib3
    from bs4 import BeautifulSoup
    import lxml
    
    def get_info(page,link):
        r = http.request('get', page)
        body = r.data
        soup = BeautifulSoup(body, 'lxml')
        a = soup.find('a', href=link)
        s, parts = '', []
    
        if a.parent.name=="p":
            for t in a.parent.contents:
                if t == a:
                    parts += [s]
                    s = ''
                    continue
                s += str(t)
            parts += [s]
        else:
            prnt = a.find_parents("p")[0]
            for t in prnt.contents:
                if t == a or (str(a) in str(t)):
                    parts+=[s]
                    s=''
                    continue
                s+=str(t)
            parts+=[s]
    
        try:
            text_before_link = BeautifulSoup(parts[0], 'lxml').body.text.strip()
        except AttributeError as error:
            text_before_link = ""
    
        try:
            text_after_link = BeautifulSoup(parts[1], 'lxml').body.text.strip()
        except AttributeError as error:
            text_after_link = ""
    
        return text_before_link, text_after_link
    

    This assumes that there is no paragraph inside another paragraph.

    If anyone has any ideas about scenarios where this fails, please feel free to mention it.