Search code examples
rextractstrsplit

Extracting string of irregular lengths : Unknown number of Authors from Citations


From Web of Science I have downloaded 500 articles citations in a textfile. Only the Authors' column (AU) have been read into R. The variable contains Author1 to AuthorN separeted by semicolons:

Anselin, L; Fujita, M; Thisse, JF

I would like to extract Author1, Author2, Author3...AuthorN in different columns. In my file I have up to 10 Authors. In this sample max 7 Authors:

 #Sample of Data
    data <- c("Anselin, L; Varga, A; Acs, Z",
    "Acs, ZJ; Anselin, L; Varga, A",
    "Anselin, L",
    "Fujita, M; Thisse, JF",
    "Turner, RK; van den Bergh, JCJM; Soderqvist, T; Barendregt, A; van der Straaten, J; Maltby, E; van Ierland, EC",
    "Talen, E; Anselin, L",
    "Irwin, EG; Bockstael, NE",
    "Leggett, CG; Bockstael, NE",
    "Guimaraes, P; Figueiredo, O; Woodward, D",
    "Halpern, Benjamin S.; McLeod, Karen L.; Rosenberg, Andrew A.; Crowder, Larry B.")

I have tried many avenues:

      #Method3 - Read table : Not same amount of elements
            Meth3 <- read.table(textConnection(data), sep=";", stringsAsFactors=FALSE)

      #Method2 - Separate in different column : repeats the Names
        Meth2 <- do.call(rbind, 
                          strsplit(gsub(";", 
                                        "\\1NONSENSESPLIT\\2NONSENSESPLIT\\3", data),
                                   "NONSENSESPLIT"))


      #Method5 - Split row entries, make an identifier and recombine them later : Struggle to recombine
        Meth5 <- strsplit(data, ";")
        i <- 0
        id <- unlist( sapply( Meth5, function(r) rep(i<<-i+1, length(r) ) ) )
        x <- unlist(Meth5, recursive = FALSE )

        x <- list(do.call(rbind, 
               strsplit(gsub(";", 
                             "\\1NONSENSESPLIT\\2NONSENSESPLIT\\3", x),
                        "NONSENSESPLIT")))
        require(data.table)
        data.table( ID=id, do.call(rbind,x))  

      #Method6: Identifies first Author :
        Meth6 <- gsub("[^a-zA-Z0-9 ]","",strsplit(data,"\\; ")[[1]][[1]])

Any suggestions for organizing and identifying the Authors1...AuthorsN is warmly welcomed.


Solution

  • read.csv has support for this:

    read.csv(text=data,header=FALSE,sep=";")
                         V1                   V2                    V3                 V4                   V5         V6               V7
    1            Anselin, L             Varga, A                Acs, Z                                                                    
    2               Acs, ZJ           Anselin, L              Varga, A                                                                    
    3            Anselin, L                                                                                                               
    4             Fujita, M           Thisse, JF                                                                                          
    5            Turner, RK  van den Bergh, JCJM         Soderqvist, T      Barendregt, A  van der Straaten, J  Maltby, E  van Ierland, EC
    6              Talen, E           Anselin, L                                                                                          
    7             Irwin, EG        Bockstael, NE                                                                                          
    8           Leggett, CG        Bockstael, NE                                                                                          
    9          Guimaraes, P        Figueiredo, O           Woodward, D                                                                    
    10 Halpern, Benjamin S.     McLeod, Karen L.  Rosenberg, Andrew A.  Crowder, Larry B.