Search code examples
rrstudiorvestcitationsgoogle-scholar

Pull out the number of citations of a list of DOIs from google scholar using R Studio


I'm working on a little programm in R Studio, which should be able to pull out the number of citations of a list of DOIs from specific scientific papers in google scholar. So far my code looks like this (I used a vector of test DOIs, my real vector contains about 450 DOIs).

library(tibble)
library(dplyr)
library(rvest)
library(purrr)
library(xml2)
library(XML)
library(gsubfn)
library(proto)
library(readxl)

test.doi <- c("10.1111/j.1749-5687.2011.00133.x", "10.2307/20159610", "10.1111/j.1467-954X.2001.tb03531.x")

html_test.doi.list <- list()

for (i in test.doi){
  urli <- paste0("https://scholar.google.de/scholar?hl=de&as_sdt=0%2C5&q=", i, "&btnG=")
  html_test.doi.list[[i]] <- read_html(urli)
}

citnum <- html_test.doi.list %>%
  map(.f=function(x){
    html_nodes(x, xpath='/html/body/div/div[11]/div[2]/div[2]/div[2]/div[1]/div/div[3]/a[3]') %>%
      html_text()
  })

citnum2 <- html_test.doi.list %>%
  map(.f=function(x){
    html_nodes(x, xpath='/html/body/div/div[11]/div[2]/div[2]/div[2]/div[1]/div/div[2]/a[3]') %>%
      html_text()
  })


citnum <- replace(citnum, citnum=="character(0)", 99999)
citnum2 <- replace(citnum2, citnum2=="character(0)", 99999)

citnumclear <- gsub("\\D","",citnum)
citnum2clear <- gsub("\\D","",citnum2)

cit.table <- cbind(test.doi, citnumclear, citnum2clear)
View(cit.table) 

The main problem is the part, which contains getting the right part from the HTML-Code, because the number of citation doesn't seem to appear on the same spot. I'm trying to avoid the problem by taking different xpath's to higher the chances of getting the information (citnum + citnum2 in my example). But I don't think that this is the best way. Maybe some of you might have any ideas?


Solution

  • I made a few changes to your 'citnum <- ...' block, that seem to be doing the job.

    citnum <- html_test.doi.list %>%
     map(.f=function(x){
      html_nodes(x, "a") %>%
      html_text() %>%
      .[grep("Zitiert von:", .)] %>%
      gsub("Zitiert von: ", "", .) %>%
      as.numeric() %>%
      .[1] # selecting citation count only for first result
      })
    

    The idea here is not to rely on an exact xpath or CSS-Selector, but to use the recurring string "Zitiert von:", that appears next to each results citation count. First the code above selects all links in the results page. grep() is used to select only those links that include the string "Zitiert von:". Than a numeric value is formed and only the first entry is selected. The last step might no be what you are looking for, change it to your liking.