Search code examples
rbioinformaticsgenetics

Find letters in a column which are unique in comparison to several other columns and count them in segments


I am a bit struggling to code a script on R to process a dataset to get an input file for another program.

I have a dataset looking like this:

df1 <- read.table(text = "
chr  pos ind0 ind1 ind2 ind3 ind4 ind5 ind6 ind7 ind8 ind9 ind10
MRVK01001299.1 972    C    C    T    N    C    C    T    N    N    C     C
MRVK01001299.1 973    G    G    G    N    G    G    G    N    N    G     G
MRVK01001299.1 997    C    T    T    T    T    T    T    T    T    T     T
MRVK01001299.1 999    A    T    T    N    T    T    T    T    T    T     T
MRVK01001299.1 1018   A    C    T    N    T    C    C    T    T    T     T
MRVK01001299.1 1086   A    T    T    T    T    T    T    T    T    T     T
MRVK01001299.1 2125   C    C    T    N    C    C    T    N    N    C     C
MRVK01001299.1 2456   G    G    G    N    G    G    G    N    N    G     G
", header = TRUE, stringsAsFactors = FALSE)

I would like to identify the position (pos) for which letters are found uniquely in ind0.

“N”s would not be counted as a different letters. So for example, we would have a unique value for position 997, 999 and 1086.

Then, I would like to count how many times ind0 has a private letters in series of 1000 for the position (pos) column. So this would be:

0 2 
1000 1
2000 0
etc

Because we have two positions with a unique value for ind0 between 0 and 1000, 1 between 1000 and 2000, 0 between 2000 and 3000. The furthest value will be above 20,000,000.

I am struggling to find a solution to code this on R. Would someone be able to help?


Solution

  • Compare value of ind0 to other individuals, and subset:

    res1 <- df1[ rowSums(df1$ind0 == df1[, -c(1:3)]) == 0 &
                           apply(df1[, -c(1:3)], 1, function(i) length(unique(i[ i != "N" ]))) == 1, ]
    
    res1
    #              chr  pos ind0 ind1 ind2 ind3 ind4 ind5 ind6 ind7 ind8 ind9 ind10
    # 3 MRVK01001299.1  997    C    T    T    T    T    T    T    T    T    T     T
    # 4 MRVK01001299.1  999    A    T    T    N    T    T    T    T    T    T     T
    # 6 MRVK01001299.1 1086    A    T    T    T    T    T    T    T    T    T     T
    

    Then we can get counts per chunks using table:

    table(cut(res1$pos, c(0, 1000, 2000, 3000)))
    # (0,1e+03] (1e+03,2e+03] (2e+03,3e+03] 
    #         2             1             0