Search code examples
pythonweb-scrapingbeautifulsoupweb-crawler

Crawl data in Top 250 Movies IDMb


Please, i need someone help me. I can't understand why I only crawl 25 movies instead of 250. My code:

import pandas as pd
import requests
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
url = "https://www.imdb.com/chart/top/?ref_=nv_mv_250"
response = requests.get(url, headers = headers)

html_doc = response.content
soup = BeautifulSoup(html_doc, "html.parser")

ls = soup.find_all("div", class_="sc-b189961a-0 hBZnfJ cli-children")
print(len(ls))

The result is 25. Link: https://www.imdb.com/chart/top/?ref_=nv_mv_250, this has 250 movies and I using BeautifulSoup. The result len(ls) should be 250. Please, explain and help me fix this. Thank you very much!

I hope I can crawl fully data on this Web


Solution

  • You need to extract the full list of movies from the JSON for Linking Data element. It's a JSON object from which the required information can be easily extracted.

    import json
    import requests
    import pandas as pd
    from bs4 import BeautifulSoup
    
    headers = {
        'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) Chrome/126.0.0.0 Safari/537.36',
        'accept': 'text/html',
    }
    
    response = requests.get('https://www.imdb.com/chart/top/?ref_=nv_mv_250', headers=headers)
    
    soup = BeautifulSoup(response.text, "html.parser")
    
    script = soup.select_one("script[type='application/ld+json']")
    
    data = json.loads(script.text)
    
    movies = []
    
    for movie in data["itemListElement"]:
        movies.append({k: movie["item"][k] for k in ["name", "url", "duration"]})
    
    movies = pd.DataFrame(movies)
    
    print(movies)
    

    Sample output:

                                                 name                                     url duration
    0                        The Shawshank Redemption   https://www.imdb.com/title/tt0111161/  PT2H22M
    1                                   The Godfather   https://www.imdb.com/title/tt0068646/  PT2H55M
    2                                 The Dark Knight   https://www.imdb.com/title/tt0468569/  PT2H32M
    3                           The Godfather Part II   https://www.imdb.com/title/tt0071562/  PT3H22M
    4                                    12 Angry Men   https://www.imdb.com/title/tt0050083/  PT1H36M
    ..                                            ...                                     ...      ...
    245                         It Happened One Night   https://www.imdb.com/title/tt0025316/  PT1H45M
    246                                       Aladdin   https://www.imdb.com/title/tt0103639/  PT1H30M
    247                                      Drishyam   https://www.imdb.com/title/tt4430212/  PT2H43M
    248                            Dances with Wolves   https://www.imdb.com/title/tt0099348/   PT3H1M
    249  Gekijôban Kimetsu no yaiba: Mugen Ressha hen  https://www.imdb.com/title/tt11032374/  PT1H57M
    
    [250 rows x 3 columns]