I have managed to to scrape this wikipedia page Oscars Nominations and extract the table under "Nominees". I can get the table by the the code below:
wiki <- "https://en.wikipedia.org/wiki/89th_Academy_Awards"
text <- wiki %>%
read_html() %>%
html_nodes('//*[@id="mw-content-text"]/table[3]') %>%
html_table()
Which outputs a 'list' as the name 'text'
test <- data.frame(one=unlist(text), stringsAsFactors=F)
row.names(test) <- NULL
test <- test[-16,]
nw_lst <- strsplit(test, "\n")
I try to put the results in a df and then remove an useless row and then 'strsplit' on the line break regex '\n' in the 'nw_lst' which outputs another list but a lot cleaner with 23 elements that corresponds to each oscar nomination with titles listed below. I then want to parse out the list into 2 df, one for the Best picture nomination and the second df with the other nominations.
oscr.bp <- data.frame(Best.Picture=unlist(nw_lst[[1]]), stringsAsFactors=F)
oscr.bp <- as.data.frame(oscr.bp[-1,], stringsAsFactors=F)
colnames(oscr.bp) <- c("Best.Picture")
So here is my issue, once I separate the nominations, I would like to clean up the text. The issue is that for some reason nothing in the 'stringr' package could remove all the unnecessary text except for the movie title.
str_replace_all(oscr.bp$Best.Picture,pattern = "\n", replacement = " ")
str_replace_all(oscr.bp$Best.Picture,pattern = "[\\^]", replacement = " ")
str_replace_all(oscr.bp$Best.Picture,pattern = "\"", replacement = " ")
str_replace_all(oscr.bp$Best.Picture,pattern = "\\s+", replacement = " ")
str_trim(oscr.bp$Best.Picture,side = "both")
But when I inspect the structure of df in my environment and click the blue arrow to see vector classes and hover your mouse over the chr vector but it has weird shapes within the character vector and has this |__truncated__
within in the string but not visible when inspecting the string in console.
I just want to know the best way to go about cleaning these strings, or another way to get just the title names for each nomination within the HTML nodes under <ul>
and <li>
parse? Do not know much about basic HTML code meanings other than looking through the source code and finding what I need using selector gadget.
Another approach is to target each individual <td>
then use the metadata available:
library(rvest)
library(tidyverse)
pg <- read_html("https://en.wikipedia.org/wiki/89th_Academy_Awards")
html_nodes(pg, xpath=".//h2[span/@id = 'Nominees']/following-sibling::table[1]") %>%
html_nodes("td") %>%
map_df(function(x) {
category <- html_nodes(x, "div") %>% html_text()
html_nodes(x, "li") %>%
map_df(function(y) {
html_nodes(y, "a") %>% html_attr("title") -> tmp
movie <- tmp[1]
nominee <- tmp[-1]
data_frame(movie=rep(movie, length(nominee)), nominee)
}) %>%
mutate(category = category)
}) %>%
select(category, movie, nominee)
## # A tibble: 236 × 3
## category movie nominee
## <chr> <chr> <chr>
## 1 Best Picture Arrival (film) Shawn Levy
## 2 Best Picture Arrival (film) David Linde
## 3 Best Picture Fences (film) Scott Rudin
## 4 Best Picture Fences (film) Denzel Washington
## 5 Best Picture Fences (film) Todd Black
## 6 Best Picture Hacksaw Ridge Bill Mechanic
## 7 Best Picture Hacksaw Ridge David Permut
## 8 Best Picture Hidden Figures Donna Gigliotti
## 9 Best Picture Hidden Figures Peter Chernin
## 10 Best Picture Hidden Figures Jenno Topping
## # ... with 226 more rows