Search code examples
rdataframedplyr

Remove column(s) with overrepresented categorical values


I have a dataset like below:

data <- data.frame(
  Col1 = c("id1", "id2", "id3", "id4","id5",  "id6", "id7", "id8"),
  Col2 = c("A", "Bc", "A", "As", "As", "Bs", "A", "A"),
  Col3 = c("BK", "AB", "BsC", "BX", "BK", "AsB", "BC", "BX"),
  Col4 = c("CA", "XB", "CA", "SC", "CA", "CA", "CA", "SC"),
  Col5 = c("Ao", "Bu", "Ai", "Ayy", "Ao", "Byu", "Aiy", "Ay"),
  Col6 = c("Bc", "Bc", "Bc", "Bc", "Bc", "Bc", "Be", "Bd")
)

data

  Col1 Col2 Col3 Col4 Col5 Col6
1  id1    A   BK   CA   Ao   Bc
2  id2   Bc   AB   XB   Bu   Bc
3  id3    A  BsC   CA   Ai   Bc
4  id4   As   BX   SC  Ayy   Bc
5  id5   As   BK   CA   Ao   Bc
6  id6   Bs  AsB   CA  Byu   Bc
7  id7    A   BC   CA  Aiy   Be
8  id8    A   BX   SC   Ay   Bd

If a category is over-represented, the columns need to be omitted. For example, if the threshold is 0.74 or 74%, the filtered data will remove Col6 as category Bc is over-represented (6/8=75%). The filtered data will be like the following:

  Col1 Col2 Col3 Col4 Col5
1  id1    A   BK   CA   Ao
2  id2   Bc   AB   XB   Bu
3  id3    A  BsC   CA   Ai
4  id4   As   BX   SC  Ayy
5  id5   As   BK   CA   Ao
6  id6   Bs  AsB   CA  Byu
7  id7    A   BC   CA  Aiy
8  id8    A   BX   SC   Ay

Or if the threshold is 60%, the filtered data will remove Col4 and Col6 as category CA (in Col4) is over-represented (5/8=62.5%) and Bc (in Col6) is over-represented (6/8=75%). The filtered data will be like the following:

  Col1 Col2 Col3 Col5
1  id1    A   BK   Ao
2  id2   Bc   AB   Bu
3  id3    A  BsC   Ai
4  id4   As   BX  Ayy
5  id5   As   BK   Ao
6  id6   Bs  AsB  Byu
7  id7    A   BC  Aiy
8  id8    A   BX   Ay

Solution

  • Loop through columns get table frequencies, check weather smaller than threshold:

    x = 0.74
    data[ sapply(data, function(i) max(prop.table(table(i)))) < x ]
    #   Col1 Col2 Col3 Col4 Col5
    # 1  id1    A   BK   CA   Ao
    # 2  id2   Bc   AB   XB   Bu
    # 3  id3    A  BsC   CA   Ai
    # 4  id4   As   BX   SC  Ayy
    # 5  id5   As   BK   CA   Ao
    # 6  id6   Bs  AsB   CA  Byu
    # 7  id7    A   BC   CA  Aiy
    # 8  id8    A   BX   SC   Ay