python beautifulsoup web-crawler html-parsing

Extract all from HTML with BeautifulSoup

i have not found a solution here on stackoverflow. So my HTML snippet is:

<d1>
<dt class="abc">Test</dt><dd><dl>
    <dt>Part1</dt><dd><p>THISISWHATINEED<br /><a href="anyurl" target="">12334</a><br /><a href="anyurl" target="">abcdef</a></p></dd>
    <dt>Part2</dt><dd><p>THISISWHATINEED2<br /><a href="anyurl" target="">12334</a><br /><a href="anyurl" target="">abcdef</a></p></dd>
<dt class="abc">Test2</dt><dd><dl>
    <dt>Part3</dt><dd><p>THISISWHATINEED3<br /><a href="anyurl" target="">12334</a><br /><a href="anyurl" target="">abcdef</a></p></dd>
    <dt>Part4</dt><dd><p>THISISWHATINEED4<br /><a href="anyurl" target="">12334</a><br /><a href="anyurl" target="">abcdef</a></p></dd>

So how do i get all the  that fit to for example <dt class="abc">Test</dt><dd><dl>. I tried to use d1.find_all("dt"), but then i am missing the . I seriously don't get the way how to get the "childs". Best thing would be to iterate over the <dt> and then inside of it over the  of for example "Test" (the first part). But how do i do that? Do you guys have any tips or ideas?

What i already tried:

        d1 = soup.find_all("dl")
        for child in d1.children:
            print(child)

And about a lot of other stuff which is not in my head anymore..

Another approach working quite good:

            for child in d1.children:
                if child.string is not None:
                    continue
                if child.string is None:
                    xx= len(child.find_all("p"))

Thanks!

Greetings Nick

Solution

Try using the adjecent sibling (+) CSS selector, which will select one element that immediately follows another one.

To use a CSS selector, use the .select() method instead of find_all().

In your example:

for tag in soup.select(".abc +dd dt +dd p"):
    print(tag.contents[0])

.abc is the class-name, so replace abc with the actual class
Since there are multiple attributes within the  tag, use .contents[0] to get the desired element

Output:

THISISWHATINEED1
THISISWHATINEED2
THISISWHATINEED3
THISISWHATINEED4

Extract all <p> from HTML with BeautifulSoup