Apologies, I am still getting acquainted with the world of dplyr and data.table, and trying to figure out its full capabilities!
I have a dataset where I am interested in grouping on a specific variable (locus):
DF <- structure(list(Gene = c("GeneA", "GeneB", "GeneC", "GeneD", "GeneE"),
Locus = c("1","2","2","3","3"),
Chromosome = c("1","1","1","1","1"),
Start = c("100","500","600","1000","1500"),
Stop = c("200","550","700","1400","1750")),
.Names = c("Gene","Locus","Chromosome","Start","Stop"),
row.names = c(NA, 5L),
class = "data.frame")
> DF
Gene Locus Chromosome Start Stop
GeneA 1 1 100 200
GeneB 2 1 500 550
GeneC 2 1 600 700
GeneD 3 1 1000 1400
GeneE 3 1 1500 1750
I was wondering whether it was possible to write out "per locus" files containing the values from the Gene, Chromosome, Start, Stop columns in instances where there are more than one value for the locus column. So Locus==1 would have no text file written out, but the values in the Gene column for Locus==2 and Locus==3 would be written to individual files? e.g.
<loc2.txt>
Gene Chromosome Start Stop
GeneB 1 500 550
GeneC 1 600 700
<loc3.txt>
Gene Chromosome Start Stop
GeneD 1 1000 1400
GeneE 1 1500 1750
Thanks in advance for any help!
library(dplyr)
newDF <- DF %>%
group_by(Locus) %>%
filter(n() > 1) %>%
nest_by()
newDF
# # A tibble: 2 x 2
# # Rowwise: Locus
# Locus data
# <chr> <list<tbl_df[,4]>>
# 1 2 [2 x 4]
# 2 3 [2 x 4]
mapply(function(x, nm) write.csv(x, nm),
newDF$data, paste0("loc", newDF$Locus, ".csv"))
# [[1]]
# NULL
# [[2]]
# NULL
The files are created in the current directory. You can safely ignore the NULL
output from mapply
.
library(data.table)
DT <- as.data.table(DF)
newDT <- DT[, .SD[.N > 1, .(data = list(.SD))], by = Locus]
newDT
# Locus data
# <char> <list>
# 1: 2 <data.table[2x4]>
# 2: 3 <data.table[2x4]>
mapply(function(x, nm) write.csv(x, nm),
newDF$data, paste0("loc", newDF$Locus, ".csv"))