Search code examples
rfor-loopdataframenested-loopsoverwrite

R: Nested for loop using indices won't overwrite data frame


I have the following data frame:

          ID<-seq(1:5)  #patient ID
          snp1<-c("A","T","A","A","T")
          snp2<-c("C","C","0","C","C")
          snp3<-c("A","G","A","A","G")
          snp4<-c("T","0","C","G","T")
          snp5<-c("G","G","G","G","A")
          dat<-data.frame(ID,snp1,snp2,snp3,snp4,snp5)
          print(dat)

which gives:

  ID snp1 snp2 snp3 snp4 snp5
1  1    A    C    A    T    G
2  2    T    C    G    0    G
3  3    A    0    A    C    G
4  4    A    C    A    G    G
5  5    T    C    G    T    A

I am trying to use a nested for loop to calculate the number of occurrences of a given value for each column in dat. To start, I create an empty data frame where the columns are snps1-5 and the rows indicate the possible values each column can take in dat:

results<- data.frame(matrix(0,ncol = 5, nrow = 5))
colnames(results)=c("snp1","snp2","snp3","snp4","snp5")
rownames(results)=c("A","T","C","G","0")

To make sure the code I want to incorporate in my loop works, I do the following:

results["A","snp1"]<-nrow(subset(dat,subset= snp1=="A"))
print(results)

which correctly gives 3 for snp1 in dat having A three times:

  snp1 snp2 snp3 snp4 snp5
A    3    0    0    0    0
T    0    0    0    0    0
C    0    0    0    0    0
G    0    0    0    0    0
0    0    0    0    0    0

I then use the following nested for loop to do the same for each column (first for loop) but repeat the process for each of the possible values a column in dat can take (second for loop):

for(i in colnames(results)){for(j in c("A","T","C","G","0")){
            snp<-as.name(i)
            results[j,i]=nrow(subset(dat,subset= snp==j))
            results
          }}
print(results)

which gives a data frame completely filled with 0's:

  snp1 snp2 snp3 snp4 snp5
A    0    0    0    0    0
T    0    0    0    0    0
C    0    0    0    0    0
G    0    0    0    0    0
0    0    0    0    0    0

I've spent hours online trying to determine what the problem is but am at loss for an explanation. I was originally hoping to do this process depending on the value of a phenotype column added to dat such that I get counts for cases and controls, but I cannot get past this point. Any suggestions would be greatly appreciated. Thank you.


Solution

  • Write a function that does the right thing for one column, e.g.,

    fun = function(x)
        table(factor(x, levels = c("A", "C", "G", "T", "0")))
    

    then apply it to all columns

    apply(dat[,-1], 2, fun)
    

    Probably it is much better to use NA rather than 0 to represent missing values; adjust the function to work in that case

    fun = function(x)
        table(factor(x, levels = c("A", "C", "G", "T")), useNA = "always")