Search code examples
rdplyrdata.tabledata-management

Writing out group_by with length > 1 to individual text files in R


Apologies, I am still getting acquainted with the world of dplyr and data.table, and trying to figure out its full capabilities!

I have a dataset where I am interested in grouping on a specific variable (locus):

DF <- structure(list(Gene = c("GeneA", "GeneB", "GeneC", "GeneD", "GeneE"), 
                Locus = c("1","2","2","3","3"),
                Chromosome = c("1","1","1","1","1"),
                Start = c("100","500","600","1000","1500"),
                Stop = c("200","550","700","1400","1750")),
                .Names = c("Gene","Locus","Chromosome","Start","Stop"), 
                row.names = c(NA, 5L), 
                class = "data.frame")

> DF
   Gene Locus Chromosome Start Stop
GeneA     1          1   100  200
GeneB     2          1   500  550
GeneC     2          1   600  700
GeneD     3          1  1000 1400
GeneE     3          1  1500 1750

I was wondering whether it was possible to write out "per locus" files containing the values from the Gene, Chromosome, Start, Stop columns in instances where there are more than one value for the locus column. So Locus==1 would have no text file written out, but the values in the Gene column for Locus==2 and Locus==3 would be written to individual files? e.g.

<loc2.txt>
   Gene Chromosome Start Stop
GeneB           1   500  550
GeneC           1   600  700

<loc3.txt>
   Gene Chromosome Start Stop
GeneD           1  1000 1400
GeneE           1  1500 1750

Thanks in advance for any help!


Solution

  • dplyr

    library(dplyr)
    newDF <- DF %>%
      group_by(Locus) %>%
      filter(n() > 1) %>%
      nest_by()
    newDF
    # # A tibble: 2 x 2
    # # Rowwise:  Locus
    #   Locus               data
    #   <chr> <list<tbl_df[,4]>>
    # 1 2                [2 x 4]
    # 2 3                [2 x 4]
    mapply(function(x, nm) write.csv(x, nm),
           newDF$data, paste0("loc", newDF$Locus, ".csv"))
    # [[1]]
    # NULL
    # [[2]]
    # NULL
    

    The files are created in the current directory. You can safely ignore the NULL output from mapply.

    data.table

    library(data.table)
    DT <- as.data.table(DF)
    newDT <- DT[, .SD[.N > 1, .(data = list(.SD))], by = Locus]
    newDT
    #     Locus              data
    #    <char>            <list>
    # 1:      2 <data.table[2x4]>
    # 2:      3 <data.table[2x4]>
    mapply(function(x, nm) write.csv(x, nm),
           newDF$data, paste0("loc", newDF$Locus, ".csv"))