Search code examples
pythonbeautifulsoupweb-crawlerhtml-parsing

Extract all <p> from HTML with BeautifulSoup


i have not found a solution here on stackoverflow. So my HTML snippet is:

<d1>
<dt class="abc">Test</dt><dd><dl>
    <dt>Part1</dt><dd><p>THISISWHATINEED<br /><a href="anyurl" target="">12334</a><br /><a href="anyurl" target="">abcdef</a></p></dd>
    <dt>Part2</dt><dd><p>THISISWHATINEED2<br /><a href="anyurl" target="">12334</a><br /><a href="anyurl" target="">abcdef</a></p></dd>
<dt class="abc">Test2</dt><dd><dl>
    <dt>Part3</dt><dd><p>THISISWHATINEED3<br /><a href="anyurl" target="">12334</a><br /><a href="anyurl" target="">abcdef</a></p></dd>
    <dt>Part4</dt><dd><p>THISISWHATINEED4<br /><a href="anyurl" target="">12334</a><br /><a href="anyurl" target="">abcdef</a></p></dd>

So how do i get all the <p> that fit to for example <dt class="abc">Test</dt><dd><dl>. I tried to use d1.find_all("dt"), but then i am missing the <p>. I seriously don't get the way how to get the "childs". Best thing would be to iterate over the <dt> and then inside of it over the <p> of for example "Test" (the first part). But how do i do that? Do you guys have any tips or ideas?

What i already tried:

        d1 = soup.find_all("dl")
        for child in d1.children:
            print(child)
     

And about a lot of other stuff which is not in my head anymore..

Another approach working quite good:

            for child in d1.children:
                if child.string is not None:
                    continue
                if child.string is None:
                    xx= len(child.find_all("p"))

Thanks!

Greetings Nick


Solution

  • Try using the adjecent sibling (+) CSS selector, which will select one element that immediately follows another one.

    To use a CSS selector, use the .select() method instead of find_all().

    In your example:

    for tag in soup.select(".abc +dd dt +dd p"):
        print(tag.contents[0])
    
    • .abc is the class-name, so replace abc with the actual class
    • Since there are multiple attributes within the <p> tag, use .contents[0] to get the desired element

    Output:

    THISISWHATINEED1
    THISISWHATINEED2
    THISISWHATINEED3
    THISISWHATINEED4