Search code examples
htmlpython-3.xbeautifulsouphtml-parsing

Using beautifulsoup to extract text between the start of paragraph tag and a line break


I have the following HTML document

<p>
  "Year: 1932"
   <br>
   <br>
  "Total Share : 0.5 Lakhs (Pure Estimate)"
  <br>
  <br>
  "Verdict"
</p>

I am currently using BeautifulSoup to obtain the other elements in HTML, but I am unable to get a way to get these lines as is. I am getting them in a single line.


Solution

  • Try like this

    from bs4 import BeautifulSoup
    
    response_data = <Your html tags>
    
    soup_data = BeautifulSoup(response_data, features="html5lib")
    string_data = soup_data.find('p').text.strip().replace("\n", ",").replace("\"", "").split(',')
    data_list=[]
    for strng in string_data:
        if strng.strip():
            data_list.append(strng.strip())
    
    print(data_list)