Search code examples
rcsvdataframeappendwrite.table

Is there an efficient way to append to an existing csv file without duplicates in R?


There is a data.frame appended to a existing file. When it is appended by write.table function, it might cause duplicated records into the file. Here is the sample code:

df1<-data.frame(name=c('a','b','c'), a=c(1,2,2))
write.csv(df1, "export.csv", row.names=FALSE, na="NA"); 

#"export.csv" keeps two copies of df1
write.table(df1,"export.csv", row.names=F,na="NA",append=T, quote= FALSE, sep=",", col.names=F);

So ideally the output file should only keep one copy of df1. But the write.table function doesn't have any parameter for duplicate check.

Thank you for any suggestion in advance.


Solution

  • You could read the data.frame from file, rbind it with the new data.frame and check for duplicate values. For writing efficiency, append only the non-duplicate rows.

    If you came up with this question because you are working with big data sets and read/write time is of concern, take a look at data.table and fread packages.

    # initial data.frame
    df1<-data.frame(name=c('a','b','c'), a=c(1,2,2))
    write.csv(df1, "export.csv", row.names=FALSE, na="NA")
    
    # a new data.frame with a couple of duplicate rows
    df2<-data.frame(name=c('a','b','c'), a=c(1,2,3))
    dfRead<-read.csv("export.csv") # read the file
    all<-rbind(dfRead, df2) # rbind both data.frames
    # get only the non duplicate rows from the new data.frame
    nonDuplicate <- all[!duplicated(all)&c(rep(FALSE, dim(dfRead)[1]), rep(TRUE, dim(df2)[1])), ]
    # append the file with the non duplicate rows
    write.table(nonDuplicate,"export.csv", row.names=F,na="NA",append=T, quote= FALSE, sep=",", col.names=F)