Search code examples
pythonhtmlbeautifulsoupurllib2

findChildren() method is storing two of the same child rather than one


From a webpage I am opening with urllib2 and scraping with BeautifulSoup, I am trying to store specific text from within the webpage.

Before you see the code, here is link to a screenshot of the HTML from the webpage so that you can understand the way I am using the find function from BeautifulSoup:

HTML from webpage

And finally, here is the code I am using:

from BeautifulSoup import BeautifulSoup
import urllib2

url = 'http://www.sciencekids.co.nz/sciencefacts/animals/bird.html'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())

ul = soup.find('ul', {'class': 'style33'})
children = ul.findChildren()
for child in children:
    print child.text

And here is the output where my problem lies:

Birds have feathers, wings, lay eggs and are warm blooded.
Birds have feathers, wings, lay eggs and are warm blooded.
There are around 10000 different species of birds worldwide.
There are around 10000 different species of birds worldwide.
The Ostrich is the largest bird in the world. It also lays the largest eggs and has the  fastest maximum running speed (97 kph).
The Ostrich is the largest bird in the world. It also lays the largest eggs and has the  fastest maximum running speed (97 kph).
Scientists believe that birds evolved from theropod dinosaurs.
Scientists believe that birds evolved from theropod dinosaurs.
Birds have hollow bones which help them fly.
Birds have hollow bones which help them fly.
Some bird species are intelligent enough to create and use tools.
Some bird species are intelligent enough to create and use tools.
The chicken is the most common species of bird found in the world.
The chicken is the most common species of bird found in the world.
Kiwis are endangered, flightless birds that live in New Zealand. They lay the largest eggs relative to their body size of any bird in the world.
Kiwis are endangered, flightless birds that live in New Zealand. They lay the largest eggs relative to their body size of any bird in the world.
Hummingbirds can fly backwards.
Hummingbirds can fly backwards.
The Bee Hummingbird is the smallest living bird in the world, with a length of just 5 cm (2 in).
The Bee Hummingbird is the smallest living bird in the world, with a length of just 5 cm (2 in).
Around 20% of bird species migrate long distances every year.
Around 20% of bird species migrate long distances every year.
Homing pigeons are bred to find their way home from long distances away and have been used for thousands of years to carry messages.
Homing pigeons are bred to find their way home from long distances away and have been used for thousands of years to carry messages.

Is there something I am using incorrectly and/or doing incorrectly in my code that is making there be two children where there should only be one? It would be easy to create some extra code so that I don't store duplicates of the same information, but I'd rather do this the right way so that I only get one of each string I am looking for.


Solution

  • children = ul.findChildren() is selecting both the <li> and <p> within the <ul>. Iterating over children is causing you to print the text property of both of these elements. To fix this, simply change children = ul.findChildren() to children = ul.findChildren("p").