Search code examples
rjoinmergedata.tablegenome

Combining tables from different numbers of rows with a master MAP table


This dataset represents a genome map positions (chr and start) with the sum of the sequencing coverage (depth) of each position for 20 individuals (dat)

Example:

gbsgre <- "chr start end depth
chr1 3273 3273 7
chr1 3274 3274 3
chr1 3275 3275 8
chr1 3276 3276 4
chr1 3277 3277 25"
gbsgre <- read.table(text=gbsgre, header=T)

This dataset represents a genome map positions (V1 plus V2) with individual coverage (V3) for each position.

Example:

df1 <- "chr start depth
        chr1 3273 4
        chr1 3276 4
        chr1 3277 15"
df1 <- read.table(text=df1, header=T)

df2 <- "chr start depth
        chr1 3273 3
        chr1 3274 3
        chr1 3275 8
        chr1 3277 10"

df2 <- read.table(text=df2, header=T)

dat <- NULL

dat[[1]] <- df1
dat[[2]] <- df2

> dat
[[1]]
   chr start depth
1 chr1  3273     4
2 chr1  3276     4
3 chr1  3277    15

[[2]]
   chr start depth
1 chr1  3273     3
2 chr1  3274     3
3 chr1  3275     8
4 chr1  3277    10

According to the chr and start position on gbsgre, I need to cross all the 20 depths (V3) of each 20 animals ([[1]] to [[20]]) to the main table (gbsgre) to generate a final table as follows: The first column will be the chromosome position (V1), second column (V2) will be the start position, third will be the depth (V3) of the “gbsgre” dataset, the fourth (V4) will be the depth (dat/V3) of the [[1]] from “dat”, and so on, until the twenty-fourth column, which will be the depth of the [[20]] on the “dat” dataset. But a very important thing is that, missing data on the 20 individuals should be considered like zero (“0”). And the number of final table should be the same of “gbsgre”.

#Example Result
> GBSMeDIP
chr start   depth   depth1  depth2
1: chr1 3273    7   4   3
2: chr1 3274    3   0   3 
3: chr1 3275    8   0   8 
4: chr1 3276    4   4   0 
5: chr1 3277    25  15  10

Solution

  • Using data.table:

    # set names to your list `dat` first
    setattr(dat, 'names', paste0("depth", seq_along(dat)))
    # bind them by rows and reshape to wide form
    dcast(rbindlist(dat, idcol="id"), chr + start ~ id, fill=0L)
    #     chr start depth1 depth2
    # 1: chr1  3273      4      3
    # 2: chr1  3274      0      3
    # 3: chr1  3275      0      8
    # 4: chr1  3276      4      0
    # 5: chr1  3277     15     10