I have a categoric variable for which I want to find the mode. In particular, I want to find the mode for the variable Ethic group, i.e. the most common ethnic group. I will do that grouping by household, i.e. finding the most common ethnic group in a household. The problem is, if there are more than one mode (more than one ethnic group are dominant one), then I would like to get the one from the household head (there is another variable that indicates whether the person is the household head).
Basically, we would have the table HQ2:
H_Code Rela_HH Ethn
1 AS-01 10 SEN
2 AS-01 1 SEN
3 AS-02 1 FA
4 AS-02 2 MA
5 AS-02 4 MA
6 AS-02 4 FA
7 AS-03 1 NZ
8 AS-03 2 MA
Where H_Code=Household code, Rela_HH=relation to household head (1 means household head), Ethn=ethnic group.
You will see that AS-02 has two modes: FA and MA. In this case, I want R to give me the one of the person with value=1 for Rela_HH.
So far, I have just managed to do a normal mode and can't find a way to code for the additional bit.
Mode <- function(x) { ux <- unique(x) ux[which.max(tabulate(match(, ux)))] } HQ2$c <- with(HQ2, ave(Ethn, H_Code, FUN=Mode))
I'm sure there's a good data.table
method to do this. I'll edit my answer if I come up with one. Right now I have an easy way to do this with order
and dplyr
. Note that the Mode
function will take the first Ethn
, so I reordered them in hq2 to demonstrate.
library(dplyr)
hq2 <- data.table(H_Code = c("AS-01", "AS-01", "AS-02", "AS-02", "AS-02", "AS-02", "AS-03", "AS-03"), Rela_HH = c(10,1,1,2,4,4,2,1), Ethn = c("SEN", "SEN", "FA", "MA", "MA", "FA", "NZ", "MA"))
hq2[order(H_Code, Rela_HH), ] %>%
group_by(H_Code) %>%
summarize(Ethn_Mode = Mode(Ethn))
# A tibble: 3 x 2
H_Code Ethn_Mode
<chr> <chr>
1 AS-01 SEN
2 AS-02 MA
3 AS-03 MA
If you had a big table, you could set your keys to H_Code
and Rela_HH
to make some quicker sorting.
library(data.table)
hq1 <- as.data.table(hq2)
hq1[order(H_Code, Rela_HH), ][, Mode(Ethn), by = list(H_Code)]
H_Code V1
1: AS-01 SEN
2: AS-02 MA
3: AS-03 MA
Edits
Here is the revised dplyr
code.
hq3 %>%
group_by(H_Code, Ethn) %>%
mutate(eth_count = sum(n())) %>%
mutate(priority = all(1 %in% Rela_HH & !1 %in% eth_count)) %>%
arrange(H_Code, desc(priority), desc(eth_count), Rela_HH) %>%
ungroup() %>%
group_by(H_Code) %>%
filter(row_number() %in% 1)
Here is the revised data.table
code, which does not use any mode
function. What it does is create a priority
column, which is set to TRUE
if any ethnicity occurs more than once, and an entry in that group has a 1
set in Rela_HH
. You then set your sorting order to work out the order and any ties (H_Code, -priority, -count, Rela_HH
).
Three datasets that have different scenarios, hq3
(HH ethnicity occurs >1, but < non-HH), hq4
(HH ethnicity occurs 1), and hq5
(HH ethnicity occurs >1, and > non-HH).
hq3 <- data.table(H_Code = c("AS-01", "AS-02", "AS-02", "AS-02", "AS-02", "AS-02", "AS-03", "AS-03"), Rela_HH = c(10,4,3,2,1,5,2,1), Ethn = c("SEN", "FA", "FA", "FA", "MA", "MA", "NZ", "MA"))
hq4 <- data.table(H_Code = c("AS-01", "AS-02", "AS-02", "AS-02", "AS-02", "AS-02", "AS-03", "AS-03"), Rela_HH = c(10,4,3,2,1,5,2,1), Ethn = c("SEN", "MA", "FA", "FA", "ZA", "MA", "NZ", "MA"))
hq5 <- data.table(H_Code = c("AS-01", "AS-02", "AS-02", "AS-02", "AS-02", "AS-02", "AS-03", "AS-03"), Rela_HH = c(10,4,3,2,1,5,2,1), Ethn = c("SEN", "MA", "FA", "FA", "MA", "MA", "NZ", "MA"))
hq3[, `:=` (count = .N), by = list(H_Code, Ethn)][, priority := all(1 %in% Rela_HH & !1 %in% count), by = list(H_Code, Ethn)][order(H_Code, Rela_HH, -count), ] # leave off the last [] section, it's here to show this output in order
H_Code Rela_HH Ethn count priority
1: AS-01 10 SEN 1 FALSE
2: AS-02 1 MA 2 TRUE
3: AS-02 2 FA 3 FALSE
4: AS-02 3 FA 3 FALSE
5: AS-02 4 FA 3 FALSE
6: AS-02 5 MA 2 TRUE
7: AS-03 1 MA 1 FALSE
8: AS-03 2 NZ 1 FALSE
hq3[order(H_Code, -priority, -count, Rela_HH), ][hq3[, .I[1], by = list(H_Code)]$V1]
H_Code Rela_HH Ethn count priority
1: AS-01 10 SEN 1 FALSE
2: AS-02 1 MA 2 TRUE
3: AS-03 1 MA 1 FALSE
Which one is faster? Data.table, roughly by a factor of 2, and with keys set on a big table, maybe more.
> microbenchmark(DTmeth(hq1), dplyrmeth(hq2), neval = 1000)
Unit: nanoseconds
expr min lq mean median uq max neval cld
DTmeth(hq1) 624865 641316 665616.17 654145.5 677087 847038 100 b
dplyrmeth(hq2) 1144980 1161583 1202274.35 1180147.5 1223617 1651814 100 c
neval 0 0 3.56 1.0 1 303 100 a
Revised code test (I'm curious if there's a better way for data.table
to do this since it's about equal with dplyr
now):
Unit: nanoseconds
expr min lq mean median uq max neval cld
DTmeth(hq5) 653542 683727.5 773249.70 706670.5 747120.5 1791880 100 b
DT2meth(hq5) 1670831 1816633.0 1947881.22 1874742.5 2040164.5 2892786 100 c
dplyrmeth(hq3) 1169733 1232672.5 1587836.22 1281876.5 1367757.5 24089844 100 c
dplyr2meth(hq3) 1414848 1506917.5 1903192.98 1541481.5 1631438.0 26743551 100 c
neval 0 1.0 18.88 1.0 1.0 303 100 a