I would like to get a table of top 10 absolute and relative frequencies for a variable across other factor variable. I have a dataframe with 3 columns: 1 column is a factor variable, 2nd is other variable I need to count, 3 is logical variable as a constraint. (real database has more than 4mln observations)
dtf<-data.frame(c("a","a","b","c","b"),c("aaa","bbb","aaa","aaa","bbb"),c(TRUE,FALSE,TRUE,TRUE,TRUE))
colnames(dtf)<-c("factor","var","log")
dtf
factor var log
1 a aaa TRUE
2 a bbb FALSE
3 b aaa TRUE
4 c aaa TRUE
5 b bbb TRUE
So I need to find top absolute and relative frequencies of "var" where "log"==TRUE across each factor of "factor".
I've tried this with absolute frequencies (in real db I extract top 10, here I get 2 lines):
t1<-tapply(dtf$var[dtf$log==T],dtf$factor[dtf$log==T],function(x)(head(sort(table(x),decreasing=T),n=2L)))
# Returns array of lists: list of factors containing list of top frequencies
t2<-(t1, ldply)
# Split list inside by id and freq
t3<-do.call(rbind, lapply(t2, data.frame))
# Returns dataframe of top "var" values and corresponding freq for each group in "factor"
# Factor variable's labels are saved as row.names in t3
The following function helps to find relative frequency as for the whole database, not grouped by factors:
getrelfreq<-function(x){
v<-table(x)
v_rel<-v/nrow(dtf[dtf$log==T,])
head(sort(v_rel,decreasing=T),n=2L)}
But I have problems with relative frequencies as I need to divide the absolute frequency by number of rows of "var" BY EACH factor, not TOTAL nrow of "var" where "log"==T. I don't know how to use that in tapply loop such that the denominator will be different for each factor. I also would like to use both functions in 1 tapply loop instead of generating many tables and merging results. But have no idea how to put such 2 functions together.
If I understand you correctly you can do something like what I have written below. Use dcast
to get the frequencies of each var
across each factor
, then use rowSums()
to add them up to get absolute frequencies for each var across all factors. You can use prop.table
to work out the relative frequency of each var
across each factor
. Note I made a slight change to your example data so you can follow what is happening at each stage (I added a 'bbb'
value for factor
b
when log == TRUE
). Try this:
#Data frame (note 2 values for 'bbb' for factor 'b' when log == TRUE)
dtf<-data.frame(c("a","a","b","c","b","b"),c("aaa","bbb","aaa","aaa","bbb","bbb"),c(TRUE,FALSE,TRUE,TRUE,TRUE,TRUE))
colnames(dtf)<-c("factor","var","log")
dtf
# factor var log
#1 a aaa TRUE
#2 a bbb FALSE
#3 b aaa TRUE
#4 c aaa TRUE
#5 b bbb TRUE
#6 b bbb TRUE
library(reshape2)
# Find frequency of each var across each factor using dcast
mydat <- dcast( dtf[dtf$log==TRUE , ] , var ~ factor , sum )
# var a b c
#1 aaa 1 1 1
#2 bbb 0 2 0
# Use rowSums to find absolute frequency of each var across all groups
mydat$counts <- rowSums( mydat[,-1] )
# Order by decreasing frequency and just use first 10 rows
mydat[ order( mydat$counts , decreasing = TRUE ) , ]
# var a b c counts
#1 aaa 1 1 1 3
#2 bbb 0 2 0 2
# Relative proportions for each var across the factors
data.frame( var = mydat$var , round( prop.table( as.matrix( mydat[,-c(1,ncol(mydat))]) , 1 ) , 2 ) )
# var a b c
#1 aaa 0.33 0.33 0.33
#2 bbb 0.00 1.00 0.00