I'm fairly new to R and I have the following issue.
I have a dataframe like this:
A | B | C | E | F |G
1 02 XXX XXX XXX 1
1 02 XXX XXX XXX 1
2 02 XXX XXX XXX NA
2 02 XXX XXX XXX NA
3 02 XXX XXX XXX 1
3 Z1 XXX XXX XXX 1
4 02 XXX XXX XXX 2
....
M 02 XXX XXX XXX 1
The thing is that the dataframe possibly has 150k rows or more, and I need to generate another dataframe grouping by A
(which is an ID) and count the following occurrences:
When B is 02 and G has 1 <- V
When B is 02 and G is NA <- W
When B is Z1 and G has 1 <- X
When B is Z1 and G is NA <- Y
Any other kind of occurrence <- Z
For this simple example, the result should look something like this
A | V | W | X | Y | Z
1 2 0 0 0 0
2 0 2 0 0 0
3 1 1 0 0 0
4 0 0 0 0 1
...
M 1 0 0 0 0
At this point I managed to get the results using a for loop:
get_counters <- function(df){
counters <- data.frame(matrix(ncol = 6, nrow = length(unique(df$A))))
colnames(counters) <- c("A", "V", "W", "X", "Y", "Z")
counters$A<- unique(df$A)
for (i in 1:nrow(counters)) {
counters$V[i] <- sum(df$A == counters$A[i] & df$B == "02" & df$G == 1, na.rm = TRUE)
counters$W[i] <- sum(df$A == counters$A[i] & df$B == "02" & is.na(df$G), na.rm = TRUE)
counters$X[i] <- sum(df$A == counters$A[i] & df$B == "Z1" & df$G== 1, na.rm = TRUE)
counters$Y[i] <- sum(df$A == counters$A[i] & df$B == "Z1" & is.na(df$G), na.rm = TRUE)
counters$Z[i] <- sum(df$A == counters$A[i] & (df$B == "Z1" | df$B == "02") & df$G!= 1, na.rm = TRUE)
}
return(counters)
}
Trying that on a small test dataframe returns all the correct results, but with the real data is extremely slow. I'm not sure how to use the apply
functions, seems like a simple problem, but I have not found an answer. So far I've assumed that if I could use apply
with the sum
statement in my for loop (maybe using group_by(A)
) I could do it, but I receive all kind of errors.
counters$V <- df%>%
group_by(A)%>%
sum(df$A == counters$A& df$B == "02" &df$G == 1, na.rm = TRUE)
Error in FUN(X[[i]], ...) : only defined on a data frame with all numeric variables In addition: Warning message: In df$A== counters$A: longer object length is not a multiple of shorter object length
If I change the function to not use a for loop and not use $
(I get an error referring to "$ operator is invalid for atomic vectors"
) I either get more errors or weird unreadable results (Large lists that contain more values that the original dataframe, huge empty matrices, etc...)
Is there a simple (maybe not simple but fast and efficient) way to solve this problem? Thanks in advance.
You can do this very quickly using data.table.
Creating Dummy Data:
set.seed(123)
counters <- data.frame(A = rep(1:100000, each = 3), B = sample(c("02","Z1"), size = 300000, replace = T), G = sample(c(1,NA), size = 300000, replace = T))
All I am doing is counting the instances of the combination, then reshaping the data in the format you need:
library(data.table)
setDT(counters)
counters[,comb := paste0(B,"_",G)]
dcast(counters, A ~ comb, fun.aggregate = length, value.var = "A")
A 02_1 02_NA Z1_1 Z1_NA
1: 1 0 2 1 0
2: 2 1 0 1 1
3: 3 0 0 2 1
4: 4 1 1 0 1
5: 5 0 1 2 0
---
99996: 99996 0 1 1 1
99997: 99997 0 2 1 0
99998: 99998 2 0 1 0
99999: 99999 1 0 1 1
100000: 100000 0 2 0 1
I adopted a naming convention that is a bit more extensible (the new columns indicate what combination you are counting), but if you want to override, replace the comb :=
line with four lines like the following:
counters[B == "02" & is.na(G), comb := "V"]
counters[B == "02" & !is.na(G), comb := "X"]
....
But I think the above is a bit more flexible.