Search code examples
pythonhtmlscreen-scraping

Ordered list of text and element data of an html element with beautifulsoup


I would like to parse the content of the following div element with BeautifulSoup (bs4):

<div><!--block-->&nbsp; &nbsp; Some text is here&nbsp;<br>&nbsp; &nbsp; &nbsp; &nbsp; - Another text&nbsp;<br>&nbsp; &nbsp; &nbsp; &nbsp; - More text&nbsp;<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;</div>

I need an ordered list of the content. The list shall contain the following items for this case:

- non-breaking space
- non-breaking space
- text data
- br
- non-breaking space
...
- non-breaking space

Using tag.find_all() I can get a list of tags like "br" but all other data such as non-breaking space or text data is not returned by tag.find_all().


Solution

  • tag.contents is what I was looking for.