Search code examples
rweb-scrapingrvest

Parsing table and urls in R with rvest


Sorry for one more scraping question.

I need data from this table: http://rspp.ru/tables/non-financial-reports-library/ It contains non financial reports of Russian companies. It is legal to scrape it. I need to do some text mining for research purpose.

Ideally I need the following output: company - year - report URL.

I'm trying to scrape it, but I can't correspond URLs to company and year data. Here's my script:

library(rvest)
library(dplyr)

url = "http://rspp.ru/tables/non-financial-reports-library/"

page = read_html(url)

# table
tab = page %>% 
  html_node("table") %>% 
  html_table(fill = T) 

# links
links = page %>% 
  html_node("table") %>% 
  html_nodes("a") %>% 
  html_attr("href")

Could you please help?


Solution

  • The table is irregular. An ugly way is to reconstruct the table by using the colspan and rowspan attribute values, within columns and rows respectively, to expand the table into a regular dataframe.

    You can then add the appropriate headers and to account for the merged cells, I simply repeat the same url across the applicable years. I do grab the text description for the years covered by a given report e.g. 2007-2009 (seen within cells with links), but do not output this as have used years in the header rows.

    library(rvest)
    library(stringr)
    
    url <- 'http://rspp.ru/tables/non-financial-reports-library/'
    page <- read_html(url)
    headers <- page %>% html_nodes('.company-report-table .register-table__row:nth-child(1) th')%>%html_text()
    companies <- page %>% html_nodes('.company-report-table .register-table__row td:nth-child(1) span')%>%html_text()
    body_rows <- page %>% html_nodes('.register-table__row ~ .register-table__row')
    df <- data.frame(matrix(NA_character_, nrow = length(body_rows), ncol = length(headers)))
    n <- 0
    
    for(row in seq_along(body_rows)){
      curr_row <-  body_rows[[row]] 
      rspan <- curr_row %>% html_node('td') %>% html_attr('rowspan') %>% as.integer() #rspan tells us how many rows per company
      
      if(!is.na(rspan)){
        n <- n + 1
        title <- companies[[n]]
      }
      df[row,1] = title 
      # handle other columns excluding first
      columns_minus_first <- curr_row %>% html_nodes('td:not(:nth-child(1))') # not always 21 range 10 > 21 but we use colspan to expand to 21
      c <- 1
      
      for(column in seq_along(columns_minus_first)){
        curr_col <- columns_minus_first[[column]]
        cspan <- curr_col %>% html_attr('colspan') %>% as.integer() #use cspan value to determine how many years report covers
        
        if(!is.na(cspan)){
          link <- paste0('http://rspp.ru', curr_col %>% html_node('a') %>% html_attr('href'))
          year <- str_extract(curr_col %>% html_text() ,'\\b[0-9-]{4,9}\\b') #purists may want a tighter regex for year spans
          
          for(i in seq_along(cspan)){ #we will start writing out from col 2 as first col is the company name
            df[row,i+c] <- link #repeats for each year covered by report (could alter this for only first)
          }
        }
        c <- c + 1
      }
    }
    
    colnames(df) <- headers
    df <- tibble(df)