I have an overview page of student statistics https://www.europa-uni.de/de/struktur/verwaltung/dezernat_1/statistiken/index.html and each semester has specific information in a html table element, e.g. https://www.europa-uni.de/de/struktur/verwaltung/dezernat_1/statistiken/2013-Wintersemester/index.html
I would like to scrape all information and put it together as a dataframe. I manually created a char vector of all URLs (perhaps there is another way).
Edit As was mentioned, some URL parts are capitalized, some are not. This list should be complete.
winters <- seq(from=2013, to=2021)
summers <- seq(from=2014, to=2022)
winters <- paste0(winters, "-wintersemester")
summers <- paste0(summers, "-Sommersemester")
all_terms <- c(rbind(winters, summers))
all_terms[1] <- "2013-Wintersemester"
all_terms[3] <- "2014-Wintersemester"
all_url <- paste0("https://www.europa-uni.de/de/struktur/verwaltung/dezernat_1/statistiken/", all_terms, "/index.html")
I can get data for a single page
all_url[1] %>%
read_html() %>%
html_table() %>%
Studierende gesamt 6645
weiblich 4206
männlich 2439
Deutsche 5001
Ausländer/innen 1644
1. Fachsemester 1783
1. Hochschulsemester 1110
But fail to write a for loop.
tables <- list()
index <- 1
for(i in length(all_url)){
table <- all_url[i] %>%
read_html() %>%
tables[index] <- table
index <- index + 1
df <- do.call("rbind", tables)
It would be great to have a dataframe with each sub-page (semester / year) as rows and all student data as columns.
Some appear not to be available. You could solve this using tryCatch
and substitute with NA
tables <- lapply(all_url, \(x) tryCatch(as.data.frame(html_table(read_html(x))),
error=\(e) NA)) |> setNames(all_terms)
tail(tables, 3)
# $`2021-Sommersemester`
# X1 X2
# 1 Studierende gesamt 5131
# 2 weiblich 3037
# 3 männlich 2054
# 4 Deutsche 3698
# 5 Ausländer/innen 1433
# 6 1. Fachsemester 394
# 7 1. Hochschulsemester 143
# $`2021-Wintersemester`
# [1] NA
# $`2022-Sommersemester`
# X1 X2
# 1 Studierende gesamt 4851
# 2 weiblich 2847
# 3 männlich 2004
# 4 Deutsche 3360
# 5 Ausländer/innen 1491
# 6 1. Fachsemester 403
# 7 1. Hochschulsemester 189
Thereafter you may want to rbind
the non-missings,
na <- is.na(tables)
tables[!na] <- Map(`[<-`, tables[!na], 'sem', value=substr(all_terms[!na], 1, 6)) ## add year column*
res <- do.call(rbind, tables[!is.na(tables)])
# X1 X2 sem
# 2013-Wintersemester.1 Studierende gesamt 6645 2013-W
# 2013-Wintersemester.2 weiblich 4206 2013-W
# 2013-Wintersemester.3 männlich 2439 2013-W
# 2013-Wintersemester.4 Deutsche 5001 2013-W
# 2013-Wintersemester.5 Ausländer/innen 1644 2013-W
# 2013-Wintersemester.6 1. Fachsemester 1783 2013-W
*better use sapply(strsplit(substr(all_terms[!na], 1, 6), '-'), \(x) paste(rev(x), collapse='_'))
here to get valid names
and reshape the data.
reshape2::dcast(res, X1 ~ sem, value.var='X2')
# X1 2013-W 2014-S 2014-W 2015-S 2016-S 2017-S 2018-S 2019-S 2020-S 2021-S 2022-S
# 1 1. Fachsemester 1783 567 1600 557 613 693 810 611 405 394 403
# 2 1. Hochschulsemester 1110 199 1020 224 240 217 273 214 78 143 189
# 3 Ausländer/innen 1644 1510 1649 1501 1576 1613 1551 1527 1369 1433 1491
# 4 Deutsche 5001 4836 4843 4599 4682 4733 4821 4523 4040 3698 3360
# 5 männlich 2439 2347 2394 2255 2292 2388 2468 2388 2197 2054 2004
# 6 Studierende gesamt 6645 6346 6492 6100 6258 6346 6372 6051 5409 5131 4851
# 7 weiblich 4206 3999 4098 3845 3966 3958 3904 3663 3212 3037 2847