Search code examples
rweb-scrapingrvest

Webscarping with rvest - Get table and span text


I looking to get the table at this link (https://clinicaltrials.gov/ct2/history/NCT04658186 ) along with the hover text on some rows enter image description here.

The result i want is to create a data frame , so that the hover text is a column on same row as its on webpage. Tried the code below where i can get the table and span text separately, unable to figure out how to merge this togeather.

library(dplyr)
library(rvest)

 # Set the URL of the webpage containing the table
  url <- "https://clinicaltrials.gov/ct2/history/NCT04658186"
  
  # Read the HTML code from the webpage
  page <- read_html(url)
  
  # Use html_table() to extract the table data
  table_data <- page %>%
    html_table(fill = TRUE) %>%
    .[[1]] # Select the first table on the page
  
  # Use html_nodes() and html_text() to extract the text from span elements within the table
  span_text <- page %>% html_nodes("span") %>% 
    html_attr("title") %>% data.frame() 

Thanks for any help in advance.


Solution

  • In such case, we can cycle through a list of elements (i.e. table rows) and extract certain bits from each item. With this approach, we'll end up with a correctly aligned list or vector that can be bound to previously extracted table:

    library(dplyr)
    library(rvest)
    library(purrr)
    
    # Set the URL of the webpage containing the table
    url <- "https://clinicaltrials.gov/ct2/history/NCT04658186"
    
    # Read the HTML code from the webpage
    page <- read_html(url)
    
    table_data <- page %>%
      # selecting the target table first to get a single table from html_table()
      html_element("table") %>% 
      html_table(fill = TRUE)
    
    # select all table rows, and cycle through those with map_chr(), 
    # map_chr returns character vecotor of the same length as 
    # input list (number of <tr> elements)
    recr_stat <- page %>% html_elements("tbody tr") %>% 
      map_chr (\(tr) html_element(tr, "span.recruitmentStatus") %>% html_attr("title"))
    
    # bind to table:
    bind_cols(table_data, `Recruitment Status` = recr_stat) %>% 
      relocate(`Recruitment Status`, .before = Changes)
    #> # A tibble: 58 × 6
    #>    Version A     B     `Submitted Date` `Recruitment Status`             Changes
    #>      <int> <lgl> <lgl> <chr>            <chr>                            <chr>  
    #>  1       1 NA    NA    December 1, 2020 <NA>                             None (…
    #>  2       2 NA    NA    January 12, 2021 Not yet recruiting --> Recruiti… Recrui…
    #>  3       3 NA    NA    January 29, 2021 <NA>                             Contac…
    #>  4       4 NA    NA    February 4, 2021 <NA>                             Study …
    #>  5       5 NA    NA    March 4, 2021    <NA>                             Study …
    #>  6       6 NA    NA    March 18, 2021   <NA>                             Contac…
    #>  7       7 NA    NA    April 15, 2021   <NA>                             Study …
    #>  8       8 NA    NA    May 14, 2021     <NA>                             Study …
    #>  9       9 NA    NA    May 27, 2021     <NA>                             Contac…
    #> 10      10 NA    NA    June 10, 2021    <NA>                             Study …
    #> # ℹ 48 more rows
    

    For a more robust approach, we can skip html_table() and extract all required details from every element (here: tr) ourselves. This also works for tableless designs where tabular data is presented through lists or divs, for example.

    results <- page %>% html_elements("tbody tr") %>% 
      map(\(tr) list(
        version  = html_element(tr, "td[headers='VersionNumber']") %>% html_text(),
        date     = html_element(tr, "td[headers='VersionDate']") %>% html_text(),
        recrstat = html_element(tr, "td[headers='Changes'] span.recruitmentStatus") %>% html_attr("title"),
        changes  = html_element(tr, "td[headers='Changes']") %>% html_text()
        )) %>% 
      bind_rows()
    
    results %>% 
      mutate(version = as.integer(version),
             date = lubridate::mdy(date))
    #> # A tibble: 58 × 4
    #>    version date       recrstat                          changes                 
    #>      <int> <date>     <chr>                             <chr>                   
    #>  1       1 2020-12-01 <NA>                              None (earliest Version …
    #>  2       2 2021-01-12 Not yet recruiting --> Recruiting Recruitment Status, Stu…
    #>  3       3 2021-01-29 <NA>                              Contacts/Locations and …
    #>  4       4 2021-02-04 <NA>                              Study Status and Contac…
    #>  5       5 2021-03-04 <NA>                              Study Status and Contac…
    #>  6       6 2021-03-18 <NA>                              Contacts/Locations and …
    #>  7       7 2021-04-15 <NA>                              Study Status and Contac…
    #>  8       8 2021-05-14 <NA>                              Study Status and Contac…
    #>  9       9 2021-05-27 <NA>                              Contacts/Locations and …
    #> 10      10 2021-10-20 <NA>                              Study Status and Contac…
    #> # ℹ 48 more rows
    

    Created on 2023-06-15 with reprex v2.0.2