Suppose I have the following HTML
html_doc = """
<html>
<head>
<title>Page Title</title>
</head>
<body>
<div class = "Box1">
<span class = "catagory">Plant</span>
<div class = "Box2">
<span class = "sub-catagory">Trees</span>
<div class = "characters">
<div class = "font-medium">1.2</div>
<div class = "font-medium">1.6</div>
<div class = "font-medium">1.7</div>
<div class = "font-medium">1.8</div>
<div class = "font-medium">1.9</div>
<div class = "font-medium">1.4</div>
</div>
<span class = "sub-catagory">Flowers</span>
<div class = "characters">
<div class = "font-medium">2.2</div>
<div class = "font-medium">3.6</div>
<div class = "font-medium">4.7</div>
<div class = "font-medium">5.8</div>
<div class = "font-medium">6.9</div>
<div class = "font-medium">7.4</div>
</div>
</div>
<span class = "catagory">animals</span>
<div class = "Box2">
<span class = "sub-catagory">human</span>
<div class = "characters">
<div class = "font-medium">7.2</div>
<div class = "font-medium">9.6</div>
<div class = "font-medium">4.7</div>
<div class = "font-medium">3.8</div>
<div class = "font-medium">6.9</div>
<div class = "font-medium">9.4</div>
</div>
<span class = "sub-catagory">dog</span>
<div class = "characters">
<div class = "font-medium">4.2</div>
<div class = "font-medium">5.6</div>
<div class = "font-medium">6.7</div>
<div class = "font-medium">1.8</div>
<div class = "font-medium">3.9</div>
<div class = "font-medium">8.4</div>
</div>
</div>
<span class = "catagory">non-living</span>
<div class = "Box2">
<span class = "sub-catagory">rock</span>
<div class = "characters">
<div class = "font-medium">1.2</div>
<div class = "font-medium">1.6</div>
<div class = "font-medium">4.7</div>
<div class = "font-medium">6.8</div>
<div class = "font-medium">1.9</div>
<div class = "font-medium">0.4</div>
</div>
<span class = "sub-catagory">stars</span>
<div class = "characters">
<div class = "font-medium">3.2</div>
<div class = "font-medium">5.6</div>
<div class = "font-medium">2.7</div>
<div class = "font-medium">4.8</div>
<div class = "font-medium">1.9</div>
<div class = "font-medium">2.4</div>
</div>
</div>
</div>
</div>
</body>
</html>
"""
Using the BeautifSoup package for Python, I am able to get the category, subcategory, characters separately as shown below:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
catagories = soup.find_all('span',class_='catatory')
for catatory in categories:
print(catagory.get_text()) #gives the Plant, Animal, non-living
sub-catatories = soup.find_all('span',class_='sub-catatory')
for sub-catatory in sub-categories:
print(sub-catagory.get_text()) # gives me subcategories
measurements = soup.find_all('div',class_='font-medium')
for measurement in measurements:
print(measurement.get_text()) # gives me all the font-medium together.
I am not sure how to get the following result since div classes are all same. Please help
Plant Trees 1.2 1.6 1.7 1.8 1.9 1.4 Flowers 2.2 3.6 4.7 5.8 6.9 7.4 animals human 7.2 9.6 4.7 3.8 6.9 9.4 dog 4.2 5.6 6.7 1.8 3.9 8.4 non-living rock 1.2 1.6 4.7 6.8 1.9 0.4 stars 3.2 5.6 2.7 4.8 1.9 2.4
Getting your texts printed in the expected way, select your Box1
and extract text with get_text()
while setting its seperat / join parameter to \n
:
print(soup.select_one('.Box1').get_text('\n',strip=True))
Plant
Trees
1.2
1.6
1.7
1.8
1.9
1.4
Flowers
2.2
3.6
4.7
5.8
6.9
7.4
animals
...
To get a more structured output change the way fetching your elements:
for e in soup.select('span.sub-catagory'):
data.append({
'cat': e.find_previous('span',{'class':'catagory'}).text,
'subcat': e.text,
'characters': list(e.find_next('div').stripped_strings)
})
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
data = []
for e in soup.select('span.sub-catagory'):
print()
data.append({
'cat': e.find_previous('span',{'class':'catagory'}).text,
'subcat': e.text,
'characters': list(e.find_next('div').stripped_strings)
})
data
[{'cat': 'Plant',
'subcat': 'Trees',
'characters': ['1.2', '1.6', '1.7', '1.8', '1.9', '1.4']},
{'cat': 'Plant',
'subcat': 'Flowers',
'characters': ['2.2', '3.6', '4.7', '5.8', '6.9', '7.4']},
{'cat': 'animals',
'subcat': 'human',
'characters': ['7.2', '9.6', '4.7', '3.8', '6.9', '9.4']},
{'cat': 'animals',
'subcat': 'dog',
'characters': ['4.2', '5.6', '6.7', '1.8', '3.9', '8.4']},
{'cat': 'non-living',
'subcat': 'rock',
'characters': ['1.2', '1.6', '4.7', '6.8', '1.9', '0.4']},
{'cat': 'non-living',
'subcat': 'stars',
'characters': ['3.2', '5.6', '2.7', '4.8', '1.9', '2.4']}]