Search code examples
pythonbeautifulsoupweb-scripting

Preserve multi-line addresses separated with `<br/>`


  • How could I remove extra blank line between address line? I am using Beautifulsoup to scraping from a web page.
  • I know that <br/> generates a new line. However, If I were to use replace to space OR strip(): the few address lines become one line. How can I preserve that I still have a few address lines as shown in the expected output below?

input from html:

<span class="c2">1233/B, LAC II, St. 37/B, Mehmoodabad # 6, (Behind United Bakery),<br />Karachi - 75640<br />Pakistan</span><br />

My code as follows:

if not (item.find('span', class_ = 'c2') is None):
        address = item.find_all('span', class_ = 'c2')
        for a in item.find_all('span', {"class":"c2"}):
            for addr in address:
                print('Before',addr)           
                    if addr.find_all("br"):
                        for a in addr:
                            print('a',a)
                            if '<br/>' in a: 
                                print('a loop',a)

                    

My output for the class(c2) span as follows:

<span class="c2">1233/B, LAC II, St. 37/B, Mehmoodabad # 6, (Behind United Bakery),<br />Karachi - 75640<br />Pakistan</span><br />

Test Output result in the loop of the span as follows:

Before <span class="c2">1233/B, LAC II, St. 37/B, Mehmoodabad # 6, (Behind United Bakery),<br/>Karachi - 75640<br/>Pakistan</span>
a 1233/B, LAC II, St. 37/B, Mehmoodabad # 6, (Behind United Bakery),
a <br/>
a Karachi - 75640
a <br/>
a Pakistan      

This causes my current undesirable output result:
1233/B, LAC II, St. 37/B, Mehmoodabad # 6, (Behind United Bakery),

Karachi - 75640

Pakistan

Expected output result:
1233/B, LAC II, St. 37/B, Mehmoodabad # 6,(Behind United Bakery),
Karachi - 75640
Pakistan


Solution

  • You can use replace_with() method of a tag object:

    from bs4 import BeautifulSoup
    
    data = '''<span class="c2">1233/B, LAC II, St. 37/B, Mehmoodabad # 6, (Behind United Bakery),<br />Karachi - 75640<br />Pakistan</span><br />'''
    
    soup = BeautifulSoup(data, 'lxml')
    
    for br in soup.select('br'):
        br.replace_with('\n')
    
    print(soup.text.strip())
    

    Prints:

    1233/B, LAC II, St. 37/B, Mehmoodabad # 6, (Behind United Bakery),
    Karachi - 75640
    Pakistan