Search code examples
rfor-loopif-statementnested-loopsdummy-variable

R - Replace observations with dummy if in top x% of var


I have some data in a large data frame (about 80x300) that looks something like this:

dum <- data.frame(id=c("a", "b", "c", "d", "e"),
                 v1=c(2, 7, 8, 5, 0),
                 v2=c(9, 2, 4, 6, 1),
                 v3=c(2, 2, 6, 1, 7))

I would like to turn each variable into a dichotomous variable indicating whether or not each particular observation is in the top 20% of each variable. {I'll then later merge the dummy dataset and the raw data set later, which is unimportant for now but if anyone wants to get creative that's the full plan.} Now the output dataframe should look something like this:

id     v1     v2     v3
a      0      1      0
b      0      0      0
c      1      0      0
d      0      0      0
e      0      0      1

My attempt at this looks like the following:

top <- 20  # set percentage
for(i in 2:ncol(dum)) {
  for(j in 1:nrow(dum)) {
    ifelse(dum[j,i]>=unname(quantile(dum[,i],probs=((100-top)/100))), dum[j,i]<-1, dum[j,i]<-0)
  }
}

However, when I run this command I end up getting more ones than desired in the output dataset in some cases and exactly the number I want in other cases. Instead of looking like what I said it should look like above, it looks like this:

id     v1     v2     v3
a      0      1      0
b      0      0      0
c      1      0      0
d      1      1      0
e      0      1      1

Can anyone help identify where I am going wrong? A few notes: 1) I am prepared to get hated on for using loops, especially nested loops, but it's something I'm familiar with and computational time is not a concern here. 2) Based on my googling it seems using the apply family of functions could be useful but I'm not very familiar with them so I wouldn't know where to begin. 3) I included the unname() command as an attempted fix but it runs the same with or without it. 4) The YES/NO part of the ifelse() statement looks funny to me but when I tried to do ifelse(cond, 1, 0) it didn't make any changes to the data frame, and i didn't understand why.

Thanks!


Solution

  • You can use apply with ifelse to do this. See below:

    apply(dum[2:4],2,function(x) {ifelse(x>=quantile(x,.8),1,0)})
    

    This returns:

         v1 v2 v3
    [1,]  0  1  0
    [2,]  0  0  0
    [3,]  1  0  0
    [4,]  0  0  0
    [5,]  0  0  1
    

    Note that I've used dum[2:4] to identify the relevant columns to conduct the conditional test. You should modify this when using your complete dataset to select only the relevant columns you want.

    If you want to merge the data with the original, you can add:

    dum2 = cbind(dum,apply(dum[2:4],2,function(x) {ifelse(x>=quantile(x,.8),1,0)}))
    

    Which returns:

      id v1 v2 v3 v1 v2 v3
    1  a  2  9  2  0  1  0
    2  b  7  2  2  0  0  0
    3  c  8  4  6  1  0  0
    4  d  5  6  1  0  0  0
    5  e  0  1  7  0  0  1