Search code examples
pythonpython-3.xbeautifulsouphtml-parsing

Beautiful Soup : How to extract data from HTML Tags from inconsistent data


I wanted to extract the data from tags which is coming in two forms :

<td><div><font> Something else</font></div></td>

and

<td><div><font> Something <br/>else</font></div></td>

I am using .string() method where in the first case it gives me the required string (Something else) but in the second case, it gives me None.

Is there any better way or alternative way to do it?


Solution

  • Try using .text property instead of .string

    from bs4 import BeautifulSoup
    
    html1 = '<td><div><font> Something else</font></div></td>'
    html2 = '<td><div><font> Something <br/>else</font></div></td>'
    
    if __name__ == '__main__':
        soup1 = BeautifulSoup(html1, 'html.parser')
        div1 = soup1.select_one('div')
        print(div1.text.strip())
    
        soup2 = BeautifulSoup(html2, 'html.parser')
        div2 = soup2.select_one('div')
        print(div2.text.strip())
    

    which outputs:

    Something else
    Something else