Search code examples
rweb-scrapingrvestimdb

rvest::html_nodes returns a partial list (only a few items)


Using the rvest package, I am trying to scrape names of actors/actresses from IMDB page for the film JFK (https://www.imdb.com/title/tt0102138/fullcredits?ref_=tt_ql_1).

SelectorGadget says that the place I want to look to find the names is "td:nth-child(2)" for every person.

Here's the code I'm using.

        library(rvest)
        library(stringr)

        startFilm <- "tt0102138" #JFK
        personsNames <- c()
        pagePath <- paste("https://www.imdb.com/title/", startFilm, "/?ref_=nv_sr_1?ref_=nv_sr_1", sep = "")
        moviePage <- read_html(pagePath)
        personNodes <- html_nodes(moviePage, "td:nth-child(2)")
        personText <- html_text(personNodes)
        for (i in 1:length(personText)){
                actor <- (unlist(str_split(personText[i], "\n")))[2]
                personsNames[i] <- substring(actor, 2, nchar(actor))
        }
        personsNames

According to the website at https://www.imdb.com/title/tt0102138/fullcredits?ref_=tt_ql_1 this list should be fairly long.

Yet when I run the code I only get back 15 names.

[1] "Sally Kirkland"  "Anthony Ramirez" "Ray LePere"      "Steve Reed"      "Jodie Farber"    "Columbia Dubose"
[7] "Randy Means"     "Kevin Costner"   "Jay O. Sanders"  "E.J. Morris"     "Cheryl Penland"  "Jim Gough"
[13] "Perry R. Russo"  "Mike Longman"    "Edward Asner"

Why is the list of names truncated?

How should I adjust my code to get the full list of actors/actresses in the film?


Solution

  • Here is what I did. If you just need actors/actresses, you can run the following code. I identified the specific location. In this way, you can precisely get the names of actors/actresses; no need for string manipulation.

    library(rvest)
    library(stringi)
    
    read_html("https://www.imdb.com/title/tt0102138/fullcredits?ref_=tt_ql_1") %>% 
    html_nodes("td.primary_photo") %>% 
    html_nodes("img") %>% 
    html_attr("alt")
    
    #  [1] "Sally Kirkland"             "Anthony Ramirez"            "Ray LePere"                 "Steve Reed"                
    #  [5] "Jodie Farber"               "Columbia Dubose"            "Randy Means"                "Kevin Costner"  
    #[249] "Mark Edward Walters"        "Earl Warren"                "John B. Wells"              "Jim White"                 
    #[253] "Phillip L. Willis"          "Rosemary Willis"            "Louis Steven Witt"          "Angus G. Wynne III"
    

    As a bonus, if you want to create a data frame with the names and characters' names, you can try the following.

    mydf <- tibble(actors = read_html("https://www.imdb.com/title/tt0102138/fullcredits?ref_=tt_ql_1") %>% 
                     html_nodes("td.primary_photo") %>% 
                     html_nodes("img") %>% 
                     html_attr("alt"),
                   characters = read_html("https://www.imdb.com/title/tt0102138/fullcredits?ref_=tt_ql_1") %>% 
                     html_nodes(".character") %>% 
                     html_text() %>% 
                     stri_replace_all_regex(pattern = "\\n|\\s{2,}", replacement = ""))
    
    #  actors          characters                             
    #   <chr>           <chr>                                  
    # 1 Sally Kirkland  Rose Cheramie                          
    # 2 Anthony Ramirez Epileptic                              
    # 3 Ray LePere      Zapruder                               
    # 4 Steve Reed      John F. Kennedy - Double               
    # 5 Jodie Farber    Jackie Kennedy - Double(as Jodi Farber)
    # 6 Columbia Dubose Nellie Connally - Double               
    # 7 Randy Means     Gov. Connally - Double                 
    # 8 Kevin Costner   Jim Garrison                           
    # 9 Jay O. Sanders  Lou Ivon                               
    #10 E.J. Morris     Plaza Witness #1