Search code examples
pythonweb-crawlerselector

Can't use a css selector to get data in python


Hi I'd like to get movie titles from this website:

enter image description here

url = "https://www.the-numbers.com/market/" + "2019" + "/top-grossing-movies"
raw = requests.get(url, headers={'User-Agent':'Mozilla/5.0'})  
html = BeautifulSoup(raw.text, "html.parser")
movie_list = html.select("#page_filling_chart > table > tbody > tr > td > b > a")
for i in range(len(movie_list)):
    print(movie_list[i].text)

I got response 200 and have no problem crawling other information. but the problem is in the variable movie_list.

When I print(movie_list), it returns just empty list, which means I'm using the tag wrong.


Solution

  • If you replace:

    movie_list = html.select("#page_filling_chart > table > tbody > tr > td > b > a")
    

    With:

    movie_list = html.select("#page_filling_chart table tr > td > b > a")
    

    You get what I think you're looking for. The primary change here is replacing child-selectors (parent > child) with descendant selectors (ancestor descendant), which is a lot more forgiving with respect to what the intervening content looks like.


    Update: this is interesting. Your choice of BeautifulSoup parser seems to lead to different behavior.

    Compare:

    >>> html = BeautifulSoup(raw, 'html.parser')
    >>> html.select('#page_filling_chart > table')
    []
    

    With:

    >>> html = BeautifulSoup(raw, 'lxml')
    >>> html.select('#page_filling_chart > table')
    [<table>
    <tr><th>Rank</th><th>Movie</th><th>Release<br/>Date</th><th>Distributor</th><th>Genre</th><th>2019 Gross</th><th>Tickets Sold</th></tr>
    <tr>
    [...]
    

    In fact, using the lxml parser you can almost use your original selector. This works:

    html.select("#page_filling_chart > table > tr > td > b > a"
    

    After parsing, a table has no tbody.

    After experimenting for a bit, you would have to rewrite your original query like this to get it to work with html.parser:

    html.select("#page_filling_chart2 > p > p > p > p > p > table > tr > td > b > a")
    

    It looks like html.parser doesn't synthesize closing </p> elements when they are missing from the source, so all the unclosed <p> tags result in a weird parsed document structure.