Search code examples
rapplyrowwise

Using R to do aggregation like tapply in matrice rowwisely


I have a problem in doing matrix computation, could you please shed some light upon it. Thank you very much in advance!

I have a data frame genderLocation and a matrix test, they correspond to each other with the index

genderLocation[,1:6]

          scanner_gender cmall_gender wechat_gender scanner_location cmall_location wechat_location
    156043              3            2             2             Guangzhou           Shenzhen            Shenzhen
    156044              2           NA            NA             Shenzhen           <NA>                
    156045              2           NA             2             Shenzhen           <NA>            Hongkong
    156046              2           NA             2             Shenzhen           <NA>            Shenzhen

test

        [,1] [,2] [,3] [,4] [,5] [,6]
    [1,]  0.8  0.7  0.6  0.6  0.7  0.7
    [2,]  0.8  1.0  1.0  0.6  0.7  0.7
    [3,]  0.8  1.0  0.6  0.6  0.7  0.7
    [4,]  0.8  1.0  0.6  0.6  0.7  0.7

Now I wanna aggregate genderLocation, compute the averages of their corresponding digits in matrix test. Take 156043 row for example, the results should be

      2    3 Guangzhou Shenzhen 
    0.65 0.80 0.60 0.70 

I dont know how to do it using the apply family(as it is not suggested to using for-loops in R). This seems to be

    > apply(test,1,function(tst,genderLoc) print(tapply(tst,as.character(genderLoc),mean)),genderLocation)

but I cannot understand the results, if limiting to the first 2 rows, it seems understandable.

    > apply(test[1:2,],1,function(tst,genderLoc) print(tapply(tst,as.character(genderLoc),mean)),genderLocation[1:2,])
           c("2", NA)       c("3", "2") c("广州", "深圳")     c("深圳", "")     c("深圳", NA) 
                 0.65              0.80              0.60              0.70              0.70 
           c("2", NA)       c("3", "2") c("广州", "深圳")     c("深圳", "")     c("深圳", NA) 
                  1.0               0.8               0.6               0.7               0.7 
                      [,1] [,2]
    c("2", NA)        0.65  1.0
    c("3", "2")       0.80  0.8
    c("广州", "深圳") 0.60  0.6
    c("深圳", "")     0.70  0.7
    c("深圳", NA)     0.70  0.7    
##### FYI
    test=matrix(c(0.8,0.8,0.8,0.8, 0.7,1,1,1, 0.6,1,0.6,0.6, 0.6,0.6,0.6,0.6, 0.7,0.7,0.7,0.7, 0.7,0.7,0.7,0.7),nrow=4,ncol=6,byrow=F)
    genderLocation<- data.frame(scanner_gender=c(3,2,2,2),cmall_gender=c(2,NA,NA,NA),wechat_gender=c(2,NA,2,2),
                                 scanner_location=c("Guangzhou","Shenzhen","Shenzhen","Shenzhen"),
                                 cmall_location=c("Shenzhen",NA,NA,NA),
                                 wechat_location=c("Shenzhen","","Hongkong","Shenzhen"))
    genderLocation1<-cbind(genderLocation,test)  # binded for some apply functions only accepting one input.

Solution

  • The following works for your example data but I don't know how stable it is with all of your data. An issue may occur if some of your rows in df do not share a common value with other rows. However, if you want to keep your output as a list, this should work with no problems (that is, skip Reduce...). Keeping that in mind...

    --Your data--

    test <- matrix(c(0.8,0.8,0.8,0.8,0.7,1,1,1,0.6,1,0.6,0.6,0.6,0.6,0.6,0.6,rep(0.7,8)), nrow=4)
    
    df <- data.frame(scanner_gender=c(3,2,2,2),
                 cmall_gender=c(2,NA,NA,NA),
                 wechat_location=c(2,NA,2,2),
                 scanner_location=c("Guanzhou","Shenzhen","Shenzhen","Shenzhen"),
                 cmall_location=c("Shenzhen",NA,NA,NA),
                 wechat_location=c("Shenzhen",NA,"Hongkong","Shenzhen"),
                 stringsAsFactors=F)
    rownames(df) <- c(156043,156044,156045,156046)
    

    --Operation--

    I combine map from purrr with other tidyverse verbs to 1) create a 2-column data frame with df row-entry in first column and test row-entry in second column, 2) then filter out where is.na(A)==T, 3) then summarise the mean by group, 4) then spread into rowwise data frame using A (keys) as columns

    L <- map(1:nrow(df),~data.frame(A=unlist(df[.x,]),B=unlist(test[.x,])) %>% 
                  filter(!is.na(A)) %>%
                  group_by(A) %>%
                  summarise(B=mean(B)) %>%
                  spread(A,B) )
    

    I then reduce this list to a data frame using Reduce and full_join

    newdf <- Reduce("full_join", L)
    

    --Output--

        `2`   `3` Guanzhou Shenzhen Hongkong
    1  0.65   0.8      0.6     0.70       NA
    2  0.80    NA       NA     0.60       NA
    3  0.70    NA       NA     0.60      0.7
    4  0.70    NA       NA     0.65       NA