Search code examples
rweb-scrapingrvestquanteda

Scrape Body of News Articles and Place into Data Frame


I'm attempting to scrape news articles and place them into a data frame, so I can analyze the text using quanteda. So far, I've been able to scrape the title,author, date, and URLs and placed them into a data frame. I've also been able to scrape articles over several pages. How can I "go into" each article to "get" the article body text to also place into the data frame?

library(rvest)
library(tidyverse)

get_articles <- function(n_articles) {
  page <- paste0("https://www.theroot.com/news/criminal-justice",
                 "?startIndex=",
                 n_articles) %>%
    read_html()
  
  tibble(
    title = page %>%
      html_elements(".aoiLP .js_link") %>%
      html_text2(),
    author = page %>%
      html_elements(".llHfhX .js_link , .permalink-bylineprop") %>%
      html_text2(),
    date = page %>%
      html_elements(".js_meta-time") %>%
      html_text2(),
    url = page %>%
      html_elements(".aoiLP .js_link") %>%
      html_attr("href")
  )
}

df <- map_dfr(seq(0, 200, by = 20), get_articles)

I've written some code to do this with one article, but unsure how do duplicate it using the function I already have.

get_article=function(article_link) {
  article_link="https://www.theroot.com/mississippi-man-arrested-for-attempting-to-hit-black-ch-1849342160"%>% 
  article_page=read_html()%>% 
  article_body=article_page%>% html_nodes(".bOfvBY")%>% html_text() %>% paste(collapse = ",")
}

Solution

  • df %>%
      slice(1:10) %>%
      mutate(content = map(url, ~ read_html(.x) %>%
                             html_elements(".bOfvBY") %>%
                             html_text2 %>% 
                             paste(collapse = ","))) %>% 
      unnest(content)
    
    # A tibble: 10 × 5
       title                                                                              author date  url   content
       <chr>                                                                              <chr>  <chr> <chr> <chr>  
     1 Man Charged in Ahmaud Arbery Murder Asks for Leniency Ahead of Sentencing          Kalyn… Toda… http… "Greg …
     2 Mississippi Man Arrested for Attempting to Hit Black Children with Car             Kalyn… 7/28… http… "White…
     3 2 Blacks Girls Charged With Hate Crimes for Attacking Woman on MTA Bus             Kalyn… 7/27… http… "Two B…
     4 [Updated] Flashy Bishop Whitehead of Brooklyn Reenacts Getting Robbed at Gunpoint  Kalyn… 7/25… http… "Bisho…
     5 Georgia Gov. Brian Kemp To Testify On Trump Probe To Overturn 2020 Election        Murja… 7/25… http… "Profe…
     6 Florida To Allow Military Veterans Teach In Schools With No Degree                 Murja… 7/23… http… "Flori…
     7 One of George Floyd’s Killers Gets Sentenced to Only 2 Years In Prison             Kalyn… 7/21… http… "Forme…
     8 Judge Finds Enough Evidence to Pursue Criminal Charges Against Elijah McClain’s K… Kalyn… 7/20… http… "A jud…
     9 Indiana Man Arrested in Connection to Black Girl’s Disappearance.                  Kalyn… 7/19… http… "Karen…
    10 “This is Not a George Floyd Situation!” Says Woman who Called Cops on Andrew Tekl… Kalyn… 7/19… http… "The t…