Search code examples
pythonhtmlbeautifulsouptagsfindall

How to filter tag without an attribute in find_all() function in Beautifulsoup?


Below are a simple html source code I'm working with

<html>
<head>
<title>Welcome to the comments assignment from www.py4e.com</title>
</head>
<body>
<h1>This file contains the actual data for your assignment - good luck!</h1>

<table border="2">
<tr>
<td>Name</td><td>Comments</td>
</tr>
<tr><td>Melodie</td><td><span class="comments">100</span></td></tr>
<tr><td>Machaela</td><td><span class="comments">100</span></td></tr>
<tr><td>Rhoan</td><td><span class="comments">99</span></td></tr>

Below is my code try to get the <td>Melodie</td> line

html='html text file aboved'

soup=BeautifulSoup(html,'html.parser')

    for tag in soup.find_all('td'):
        print(tag) 
        print('----') #Result:
#===============================================================================
# <td>Name</td>
# ----
# <td>Comments</td>
# ----
# <td>Melodie</td>
# ----
# <td><span class="comments">100</span></td>
# ----
# <td>Machaela</td>
# ----
# <td><span class="comments">100</span></td>
# ----
# <td>Rhoan</td>
# ----
#.........
#===============================================================================

Now I want to get the <td>name<td> lines only and not the line with 'span' and 'class'. I try 2 filters soup.find_all('td' and not 'span') and soup.find_all('td', attrs={'class':None}) but none of those work. I know there is other way around but I want to use the filter in soup.find_all(). My expected output (actually my final goal is to get the name of person between two <td>):

# <td>Name</td>
# ----
# <td>Comments</td>
# ----
# <td>Melodie</td>
# ----
# <td>Machaela</td>
# ----
# <td>Rhoan</td>
# ----

Solution

  • Select your elements via css selectors e.g. nest pseudo classes :has() and :not():

    soup.select('td:not(:has(span))')
    

    or

    soup.select('td:not(:has(.comments))')
    

    Example

    from bs4 import BeautifulSoup
    html=urllib.request.urlopen('http://py4e-data.dr-chuck.net/comments_1430669.html').read()
    
    soup=BeautifulSoup(html,'html.parser')
    
    for e in soup.select('td:not(:has(span))'):
        print(e)
    

    Output

    <td>Name</td>
    <td>Comments</td>
    <td>Melodie</td>
    <td>Machaela</td>
    <td>Rhoan</td>
    <td>Murrough</td>
    <td>Lilygrace</td>
    ...