I have a dataset like below:
data <- data.frame(
Col1 = c("id1", "id2", "id3", "id4","id5", "id6", "id7", "id8"),
Col2 = c("A", "Bc", "A", "As", "As", "Bs", "A", "A"),
Col3 = c("BK", "AB", "BsC", "BX", "BK", "AsB", "BC", "BX"),
Col4 = c("CA", "XB", "CA", "SC", "CA", "CA", "CA", "SC"),
Col5 = c("Ao", "Bu", "Ai", "Ayy", "Ao", "Byu", "Aiy", "Ay"),
Col6 = c("Bc", "Bc", "Bc", "Bc", "Bc", "Bc", "Be", "Bd")
)
data
Col1 Col2 Col3 Col4 Col5 Col6
1 id1 A BK CA Ao Bc
2 id2 Bc AB XB Bu Bc
3 id3 A BsC CA Ai Bc
4 id4 As BX SC Ayy Bc
5 id5 As BK CA Ao Bc
6 id6 Bs AsB CA Byu Bc
7 id7 A BC CA Aiy Be
8 id8 A BX SC Ay Bd
If a category is over-represented, the columns need to be omitted. For example, if the threshold is 0.74
or 74%
, the filtered data
will remove Col6
as category Bc
is over-represented (6/8=75%)
. The filtered data
will be like the following:
Col1 Col2 Col3 Col4 Col5
1 id1 A BK CA Ao
2 id2 Bc AB XB Bu
3 id3 A BsC CA Ai
4 id4 As BX SC Ayy
5 id5 As BK CA Ao
6 id6 Bs AsB CA Byu
7 id7 A BC CA Aiy
8 id8 A BX SC Ay
Or if the threshold is 60%
, the filtered data
will remove Col4
and Col6
as category CA
(in Col4
) is over-represented (5/8=62.5%)
and Bc
(in Col6
) is over-represented (6/8=75%)
. The filtered data
will be like the following:
Col1 Col2 Col3 Col5
1 id1 A BK Ao
2 id2 Bc AB Bu
3 id3 A BsC Ai
4 id4 As BX Ayy
5 id5 As BK Ao
6 id6 Bs AsB Byu
7 id7 A BC Aiy
8 id8 A BX Ay
Loop through columns get table frequencies, check weather smaller than threshold:
x = 0.74
data[ sapply(data, function(i) max(prop.table(table(i)))) < x ]
# Col1 Col2 Col3 Col4 Col5
# 1 id1 A BK CA Ao
# 2 id2 Bc AB XB Bu
# 3 id3 A BsC CA Ai
# 4 id4 As BX SC Ayy
# 5 id5 As BK CA Ao
# 6 id6 Bs AsB CA Byu
# 7 id7 A BC CA Aiy
# 8 id8 A BX SC Ay