Search code examples
rweb-scrapingrstudiowiki

Web Scraping Wikipedia - string manipulation


I have managed to to scrape this wikipedia page Oscars Nominations and extract the table under "Nominees". I can get the table by the the code below:

wiki <- "https://en.wikipedia.org/wiki/89th_Academy_Awards"
text <- wiki %>% 
         read_html() %>% 
         html_nodes('//*[@id="mw-content-text"]/table[3]') %>% 
         html_table()

Which outputs a 'list' as the name 'text'

test <- data.frame(one=unlist(text), stringsAsFactors=F)
row.names(test) <- NULL
test <- test[-16,]
nw_lst <- strsplit(test, "\n")

I try to put the results in a df and then remove an useless row and then 'strsplit' on the line break regex '\n' in the 'nw_lst' which outputs another list but a lot cleaner with 23 elements that corresponds to each oscar nomination with titles listed below. I then want to parse out the list into 2 df, one for the Best picture nomination and the second df with the other nominations.

oscr.bp <- data.frame(Best.Picture=unlist(nw_lst[[1]]), stringsAsFactors=F)
oscr.bp <- as.data.frame(oscr.bp[-1,], stringsAsFactors=F)
colnames(oscr.bp) <- c("Best.Picture")

So here is my issue, once I separate the nominations, I would like to clean up the text. The issue is that for some reason nothing in the 'stringr' package could remove all the unnecessary text except for the movie title.

str_replace_all(oscr.bp$Best.Picture,pattern = "\n", replacement = " ") 
str_replace_all(oscr.bp$Best.Picture,pattern = "[\\^]", replacement = " ") 
str_replace_all(oscr.bp$Best.Picture,pattern = "\"", replacement = " ") 
str_replace_all(oscr.bp$Best.Picture,pattern = "\\s+", replacement = " ") 
str_trim(oscr.bp$Best.Picture,side = "both")

But when I inspect the structure of df in my environment and click the blue arrow to see vector classes and hover your mouse over the chr vector but it has weird shapes within the character vector and has this |__truncated__ within in the string but not visible when inspecting the string in console.

I just want to know the best way to go about cleaning these strings, or another way to get just the title names for each nomination within the HTML nodes under <ul> and <li> parse? Do not know much about basic HTML code meanings other than looking through the source code and finding what I need using selector gadget.


Solution

  • Another approach is to target each individual <td> then use the metadata available:

    library(rvest)
    library(tidyverse)
    
    pg <- read_html("https://en.wikipedia.org/wiki/89th_Academy_Awards")
    
    html_nodes(pg, xpath=".//h2[span/@id = 'Nominees']/following-sibling::table[1]") %>%
      html_nodes("td") %>%
      map_df(function(x) {
        category <- html_nodes(x, "div") %>% html_text()
        html_nodes(x, "li") %>%
          map_df(function(y) {
            html_nodes(y, "a") %>% html_attr("title") -> tmp
            movie <- tmp[1]
            nominee <- tmp[-1]
            data_frame(movie=rep(movie, length(nominee)), nominee)
          }) %>%
          mutate(category = category)
      }) %>%
      select(category, movie, nominee)
    ## # A tibble: 236 × 3
    ##        category          movie           nominee
    ##           <chr>          <chr>             <chr>
    ## 1  Best Picture Arrival (film)        Shawn Levy
    ## 2  Best Picture Arrival (film)       David Linde
    ## 3  Best Picture  Fences (film)       Scott Rudin
    ## 4  Best Picture  Fences (film) Denzel Washington
    ## 5  Best Picture  Fences (film)        Todd Black
    ## 6  Best Picture  Hacksaw Ridge     Bill Mechanic
    ## 7  Best Picture  Hacksaw Ridge      David Permut
    ## 8  Best Picture Hidden Figures   Donna Gigliotti
    ## 9  Best Picture Hidden Figures     Peter Chernin
    ## 10 Best Picture Hidden Figures     Jenno Topping
    ## # ... with 226 more rows