Search code examples
rfor-loopweb-scrapingrvest

Scrape a list of webpages and put them toegether as dataframe


I have an overview page of student statistics https://www.europa-uni.de/de/struktur/verwaltung/dezernat_1/statistiken/index.html and each semester has specific information in a html table element, e.g. https://www.europa-uni.de/de/struktur/verwaltung/dezernat_1/statistiken/2013-Wintersemester/index.html

I would like to scrape all information and put it together as a dataframe. I manually created a char vector of all URLs (perhaps there is another way).

Edit As was mentioned, some URL parts are capitalized, some are not. This list should be complete.

winters <- seq(from=2013, to=2021)
summers <- seq(from=2014, to=2022)
winters <- paste0(winters, "-wintersemester")
summers <- paste0(summers, "-Sommersemester")

all_terms <- c(rbind(winters, summers))
all_terms[1] <- "2013-Wintersemester"
all_terms[3] <- "2014-Wintersemester"

all_url <- paste0("https://www.europa-uni.de/de/struktur/verwaltung/dezernat_1/statistiken/", all_terms, "/index.html")

I can get data for a single page

all_url[1] %>%
  read_html() %>% 
  html_table() %>% 
  as.data.frame()

Studierende gesamt  6645            
weiblich    4206            
männlich    2439            
Deutsche    5001            
Ausländer/innen 1644            
1. Fachsemester 1783            
1. Hochschulsemester    1110    

But fail to write a for loop.

tables <- list()
index <- 1

for(i in length(all_url)){
  table <- all_url[i] %>% 
    read_html() %>% 
    html_table()

  tables[index] <- table

  index <- index + 1
}

df <- do.call("rbind", tables)

It would be great to have a dataframe with each sub-page (semester / year) as rows and all student data as columns.


Solution

  • Some appear not to be available. You could solve this using tryCatch and substitute with NA.

    library(rvest)
    tables <- lapply(all_url, \(x) tryCatch(as.data.frame(html_table(read_html(x))),
                                            error=\(e) NA)) |> setNames(all_terms)
    
    tail(tables, 3)
    # $`2021-Sommersemester`
    #                     X1   X2
    # 1   Studierende gesamt 5131
    # 2             weiblich 3037
    # 3             männlich 2054
    # 4             Deutsche 3698
    # 5      Ausländer/innen 1433
    # 6      1. Fachsemester  394
    # 7 1. Hochschulsemester  143
    # 
    # $`2021-Wintersemester`
    # [1] NA
    # 
    # $`2022-Sommersemester`
    #                     X1   X2
    # 1   Studierende gesamt 4851
    # 2             weiblich 2847
    # 3             männlich 2004
    # 4             Deutsche 3360
    # 5      Ausländer/innen 1491
    # 6      1. Fachsemester  403
    # 7 1. Hochschulsemester  189
    

    Thereafter you may want to rbind the non-missings,

    na <- is.na(tables)
    tables[!na] <- Map(`[<-`, tables[!na], 'sem', value=substr(all_terms[!na], 1, 6))  ## add year column*
    res <- do.call(rbind, tables[!is.na(tables)])
    head(res)
    #                                       X1   X2    sem
    # 2013-Wintersemester.1 Studierende gesamt 6645 2013-W
    # 2013-Wintersemester.2           weiblich 4206 2013-W
    # 2013-Wintersemester.3           männlich 2439 2013-W
    # 2013-Wintersemester.4           Deutsche 5001 2013-W
    # 2013-Wintersemester.5    Ausländer/innen 1644 2013-W
    # 2013-Wintersemester.6    1. Fachsemester 1783 2013-W
    

    *better use sapply(strsplit(substr(all_terms[!na], 1, 6), '-'), \(x) paste(rev(x), collapse='_')) here to get valid names

    and reshape the data.

    reshape2::dcast(res, X1 ~ sem, value.var='X2')
    #                     X1 2013-W 2014-S 2014-W 2015-S 2016-S 2017-S 2018-S 2019-S 2020-S 2021-S 2022-S
    # 1      1. Fachsemester   1783    567   1600    557    613    693    810    611    405    394    403
    # 2 1. Hochschulsemester   1110    199   1020    224    240    217    273    214     78    143    189
    # 3      Ausländer/innen   1644   1510   1649   1501   1576   1613   1551   1527   1369   1433   1491
    # 4             Deutsche   5001   4836   4843   4599   4682   4733   4821   4523   4040   3698   3360
    # 5             männlich   2439   2347   2394   2255   2292   2388   2468   2388   2197   2054   2004
    # 6   Studierende gesamt   6645   6346   6492   6100   6258   6346   6372   6051   5409   5131   4851
    # 7             weiblich   4206   3999   4098   3845   3966   3958   3904   3663   3212   3037   2847