I am a bit struggling to code a script on R to process a dataset to get an input file for another program.
I have a dataset looking like this:
df1 <- read.table(text = "
chr pos ind0 ind1 ind2 ind3 ind4 ind5 ind6 ind7 ind8 ind9 ind10
MRVK01001299.1 972 C C T N C C T N N C C
MRVK01001299.1 973 G G G N G G G N N G G
MRVK01001299.1 997 C T T T T T T T T T T
MRVK01001299.1 999 A T T N T T T T T T T
MRVK01001299.1 1018 A C T N T C C T T T T
MRVK01001299.1 1086 A T T T T T T T T T T
MRVK01001299.1 2125 C C T N C C T N N C C
MRVK01001299.1 2456 G G G N G G G N N G G
", header = TRUE, stringsAsFactors = FALSE)
I would like to identify the position (pos) for which letters are found uniquely in ind0.
“N”s would not be counted as a different letters. So for example, we would have a unique value for position 997, 999 and 1086.
Then, I would like to count how many times ind0 has a private letters in series of 1000 for the position (pos) column. So this would be:
0 2
1000 1
2000 0
etc
Because we have two positions with a unique value for ind0 between 0 and 1000, 1 between 1000 and 2000, 0 between 2000 and 3000. The furthest value will be above 20,000,000.
I am struggling to find a solution to code this on R. Would someone be able to help?
Compare value of ind0 to other individuals, and subset:
res1 <- df1[ rowSums(df1$ind0 == df1[, -c(1:3)]) == 0 &
apply(df1[, -c(1:3)], 1, function(i) length(unique(i[ i != "N" ]))) == 1, ]
res1
# chr pos ind0 ind1 ind2 ind3 ind4 ind5 ind6 ind7 ind8 ind9 ind10
# 3 MRVK01001299.1 997 C T T T T T T T T T T
# 4 MRVK01001299.1 999 A T T N T T T T T T T
# 6 MRVK01001299.1 1086 A T T T T T T T T T T
Then we can get counts per chunks using table:
table(cut(res1$pos, c(0, 1000, 2000, 3000)))
# (0,1e+03] (1e+03,2e+03] (2e+03,3e+03]
# 2 1 0