Search code examples
rcsvdataframescopus

R how to import a list with different number of columns to a data frame


I am trying to perform some scientometrics analysis from a Scopus csv file. The first column of the imported csv is like:

Authors,Title,Year,Source title,Volume,Issue,Art. No.,Page start,Page end,Page count,Cited by,DOI,Link,Document Type,Source,EID
"Kuck, L.S., Noreña, C.P.Z.","Microencapsulation of grape (Vitis labrusca var. Bordo) skin phenolic extract using gum Arabic, polydextrose, and partially hydrolyzed guar gum as encapsulating agents",2016,"Food Chemistry","194",,,"569","576",,,10.1016/j.foodchem.2015.08.066,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84940212199&partnerID=40&md5=e4c36e03156570a7fe31c2937b3a170d",Article,Scopus,2-s2.0-84940212199
"Grasel, F.D.S., Ferrão, M.F., Wolf, C.R.","Development of methodology for identification the nature of the polyphenolic extracts by FTIR associated with multivariate analysis",2016,"Spectrochimica Acta - Part A: Molecular and Biomolecular Spectroscopy","153",,,"94","101",,,10.1016/j.saa.2015.08.020,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84939865445&partnerID=40&md5=8239487f4eea9479d698792e6aa348de",Article,Scopus,2-s2.0-84939865445
"De Souza, D., Sbardelotto, A.F., Ziegler, D.R., Marczak, L.D.F., Tessaro, I.C.","Characterization of rice starch and protein obtained by a fast alkaline extraction method",2016,"Food Chemistry","191",, 17279,"36","44",,,10.1016/j.foodchem.2015.03.032,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84938952690&partnerID=40&md5=989cbfcc72286a87f726925732db4b49",Article,Scopus,2-s2.0-84938952690
"Filho, P.R.M., Vercelino, R., Cioato, S.G., Medeiros, L.F., de Oliveira, C., Scarabelot, V.L., Souza, A., Rozisky, J.R., Quevedo, A.S., Adachi, L.N.S., Sanches, P.R.S., Fregni, F., Caumo, W., Torres, I.L.S.","Transcranial direct current stimulation (tDCS) reverts behavioral alterations and brainstem BDNF level increase induced by neuropathic pain model: Long-lasting effect",2016,"Progress in Neuro-Psychopharmacology and Biological Psychiatry","64",,,"44","51",,,10.1016/j.pnpbp.2015.06.016,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84937468588&partnerID=40&md5=b03f0ccfbf66a49a438c9983cc2e8f9d",Article,Scopus,2-s2.0-84937468588
"Duarte, Á.T., Borges, A.R., Zmozinski, A.V., Dessuy, M.B., Welz, B., De Andrade, J.B., Vale, M.G.R.","Determination of lead in biomass and products of the pyrolysis process by direct solid or liquid sample analysis using HR-CS GF AAS",2016,"Talanta","146",,,"166","174",,,10.1016/j.talanta.2015.08.041,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84940416990&partnerID=40&md5=55d7ddad27e955b9b6e269469e29c8c3",Article,Scopus,2-s2.0-84940416990
"Francischini, H., Paes Neto, V.D., Martinelli, A.G., Pereira, V.P., Marinho, T.S., Teixeira, V.P.A., Ferraz, M.L.F., Soares, M.B., Schultz, C.L.","Invertebrate traces in pseudo-coprolites from the upper Cretaceous Marília Formation (Bauru Group), Minas Gerais State, Brazil",2016,"Cretaceous Research","57",,,"29","39",,,10.1016/j.cretres.2015.07.016,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84939175950&partnerID=40&md5=b049de15a08ba477cc189d7e8fe7f0a3",Article,Scopus,2-s2.0-84939175950
"Bonfatti, B.R., Hartemink, A.E., Giasson, E., Tornquist, C.G., Adhikari, K.","Digital mapping of soil carbon in a viticultural region of Southern Brazil",2016,"Geoderma","261",,,"204","221",,,10.1016/j.geoderma.2015.07.016,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84939499978&partnerID=40&md5=b470166e01648dcbe8f0d43be86c84e0",Article,Scopus,2-s2.0-84939499978
"Scaramuzza dos Santos, T.C., Holanda, E.C., de Souza, V., Guerra-Sommer, M., Manfroi, J., Uhl, D., Jasper, A.","Evidence of palaeo-wildfire from the upper Lower Cretaceous (Serra do Tucano Formation, Aptian-Albian) of Roraima (North Brazil)",2016,"Cretaceous Research","57",,,"46","49",,,10.1016/j.cretres.2015.08.003,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84939615367&partnerID=40&md5=e59f5130c6a2e1863f9aa77c960e6462",Article,Scopus,2-s2.0-84939615367
"da Silva, S.W., Bortolozzi, J.P., Banús, E.D., Bernardes, A.M., Ulla, M.A.","TiO<inf>2</inf> thick films supported on stainless steel foams and their photoactivity in the nonylphenol ethoxylate mineralization",2016,"Chemical Engineering Journal","283",, 14049,"1264","1272",,,10.1016/j.cej.2015.08.057,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84940747062&partnerID=40&md5=aebc7357f9dedaadeebabfeda4aa3dd9",Article,Scopus,2-s2.0-84940747062
"Dalmora, A.C., Ramos, C.G., Oliveira, M.L.S., Teixeira, E.C., Kautzmann, R.M., Taffarel, S.R., de Brum, I.A.S., Silva, L.F.O.","Chemical characterization, nano-particle mineralogy and particle size distribution of basalt dust wastes",2016,"Science of the Total Environment","539",, 18331,"560","565",,,10.1016/j.scitotenv.2015.08.141,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84941754626&partnerID=40&md5=1c4ca1a3486ff55f92f238083af3eb50",Article,Scopus,2-s2.0-84941754626
"Fink, J.R., Inda, A.V., Bavaresco, J., Barrón, V., Torrent, J., Bayer, C.","Adsorption and desorption of phosphorus in subtropical soils as affected by management system and mineralogy",2016,"Soil and Tillage Research","155",,,"62","68",,,10.1016/j.still.2015.07.017,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84940195225&partnerID=40&md5=2e43a874f1e36f11aa5efa057ce660b9",Article,Scopus,2-s2.0-84940195225
"Martins, A.B., Santana, R.M.C.","Effect of carboxylic acids as compatibilizer agent on mechanical properties of thermoplastic starch and polypropylene blends",2016,"Carbohydrate Polymers","135",,,"79","85",,,10.1016/j.carbpol.2015.08.074,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84940781718&partnerID=40&md5=426e62c6c0de33a91bdb2f75442fbd6f",Article,Scopus,2-s2.0-84940781718

In each line, there are a variable number of authors (up to more than 20). Until now, I am doing something like:

test <- read.csv("test.csv")
test$Authors <- as.character(test$Authors)
test2 <- strsplit(as.character(test$Authors), '.,', fixed=TRUE)

Which gives me a list correctly separating each author. I tested several alternatives proposed in the list to move the list to a data frame, but the closer one was:

test3 <- str_split_fixed(test$Authors, '.,', n = 20)

Which gave me two problems:

1) I have to define the number of columns, which I do not know before analyzing the data; 2) the authors are not properly separated, but surname and abbreviated names are in different columns. Additionally, the command removed some characters from the names.

Some of the strategies suggested elsewhere gave me the correct separation of authors in columns, but the empty columns were fulfilled by repeating the initial names. Sorry if the question is too naive, I am beginning in the use of R.

Any advises and or insights would be greatly appreciated!


Solution

  • Here's how I would do it.

    Firstly, using read.csv is causing the split between Authors last name and initial, so I'm using readLines instead.

    Secondly, having "wide data" like this is not in general a good idea. It makes data more difficult to work with in subsequent analyses. For that reason, I've made it "long" instead.

    n1 <- readLines(con="test.csv")
    n1 <- strsplit(n1, '., ', fixed=TRUE)
    n1 <- do.call(rbind, lapply(1:length(n1), function(x){data.frame(aut = n1[[x]], pub = x, order = 1:length(n1[[x]]))}))
    n1$aut <- gsub("\\.$", "", n1$aut)
    

    Here's the output:

                   aut pub order
    1        Kuck, L.S   1     1
    2   Noreña, C.P.Z   1     2
    3    Grasel, F.D.S   2     1
    4     Ferrão, M.F   2     2
    5        Wolf, C.R   2     3
    6       Abreu, M.S   3     1
    7 Giacomini, A.C.V   3     2
    8         Gusso, D   3     3
    9      Rosa, J.G.S   3     4
    

    NB if you really want your data in "wide format", we can easily reshape it:

    library(tidyr)
    spread(n1, order, aut)
    
      pub             1                2         3           4
    1   1     Kuck, L.S   Noreña, C.P.Z      <NA>        <NA>
    2   2 Grasel, F.D.S     Ferrão, M.F Wolf, C.R        <NA>
    3   3    Abreu, M.S Giacomini, A.C.V  Gusso, D Rosa, J.G.S
    

    EDIT: for your full version, you need to use read.csv:

    input <- n1 <- read.csv("test.csv")
    n1$Authors <- as.character(n1$Authors)
    n1$Authors <- strsplit(n1$Authors, '., ', fixed=TRUE)
    n1 <- do.call(rbind, lapply(1:length(n1$Authors), function(x){data.frame(aut = n1$Authors[[x]], pub = x, order = 1:length(n1$Authors[[x]]))}))
    n1$aut <- gsub("\\.$", "", n1$aut)
    

    If you want to go back to wide with all your stuff:

    library(dplyr)
    library(tidyr)
    input <- mutate(input, row = row_number())
    n1 %>% spread(order, aut) %>%
           left_join(input, by = c("pub" = "row")) %>%
           select(-Authors)