Search code examples
rcran

How can I extract the names of all package authors from CRAN


To celebrate the 100,000th question in the tag, I'd like to create a list of the names of all package authors on CRAN.

Initially, I thought I could do this using available.packages() but sadly this doesn't contain a column of the authors.

pdb <- available.packages()
colnames(pdb)

 [1] "Package"               "Version"               "Priority"             
 [4] "Depends"               "Imports"               "LinkingTo"            
 [7] "Suggests"              "Enhances"              "License"              
[10] "License_is_FOSS"       "License_restricts_use" "OS_type"              
[13] "Archs"                 "MD5sum"                "NeedsCompilation"     
[16] "File"                  "Repository"   

This information is available in the DESCRIPTION file for each package. So I can think of two brute force ways, neither of which are very elegant:

  1. Download each of the 6,878 packages and read the DESCRIPTION file using base::read.dcf()

  2. Scrape each of the package pages on CRAN. For example, https://cran.r-project.org/web/packages/MASS/index.html tells me that Brian Ripley is the author of MASS.

I don't want to download all of CRAN to answer this question. And I don't want to scrape the HTML either, since the information in the DESCRIPTION file is a neatly formatted list of person objects (see ?person).

How can I use the information on CRAN to easily build a list of package authors?


Solution

  • Taken from reverse_dependencies_with_maintainers, which was available at one point on the R developer site (I don't see it there now):

      description <- sprintf("%s/web/packages/packages.rds",
                              getOption("repos")["CRAN"])
      con <- if(substring(description, 1L, 7L) == "file://") {
           file(description, "rb")
      } else {
          url(description, "rb")
      }
      db <- as.data.frame(readRDS(gzcon(con)),stringsAsFactors=FALSE)
      close(con)
      rownames(db) <- NULL
    
      head(db$Author)
      head(db$"Authors@R")
    

    Where Authors@R exists it might be parseable into something better using dget()

    getAuthor <- function(x){
      if(is.na(x)) return(NA)
      a <- textConnection(x)
      on.exit(close(a))
      dget(a)
    }
    authors <- lapply(db$"Authors@R", getAuthor)
    head(authors)
    
    [[1]]
    [1] NA
    
    [[2]]
    [1] "Gaurav Sood <gsood07@gmail.com> [aut, cre]"
    
    [[3]]
    [1] "Csillery Katalin <kati.csillery@gmail.com> [aut]"
    [2] "Lemaire Louisiane [aut]"                         
    [3] "Francois Olivier [aut]"                          
    [4] "Blum Michael <michael.blum@imag.fr> [aut, cre]"  
    
    [[4]]
    [1] NA
    
    [[5]]
    [1] "Csillery Katalin <kati.csillery@gmail.com> [aut]"
    [2] "Lemaire Louisiane [aut]"                         
    [3] "Francois Olivier [aut]"                          
    [4] "Blum Michael <michael.blum@imag.fr> [aut, cre]"  
    
    [[6]]
    [1] NA