Search code examples
rrowcontingency

Delete row of contingency table based on cell value


I have a data frame with approximately 20,000 observations. From this I've created a contingency table with frequencies of two variables.

With this I want to perform a chi-squared test of independence to see if there is a relationship between my two variables. Ordinarily this is easy but many cells have expected values of 0, despite the large size of the original data frame. I want to delete any rows that contain a frequency less than 5.

I've searched stack exchange extensively but I can't find a solution to this specific problem that I either a) understand (I'm relatively new to R), or b) that works with a contingency table rather than the original data frame.

Any help greatly appreciated.

Edit:

Thanks for your response Justin.

As requested, I've uploaded extracts of the dataframe and contingency table. I've also uploaded the small amount of code I've tried so far, with results.

Dataframe

Department Super
AAP     1
ACS     4
ACE     1
AMA     1
APS     3
APS     2
APS     1
APS     1
ARC     5
ARC     7
ARC     1
BIB     6
BIB     6
BMS     2

So there are two columns, the first a three-letter department code and the second a one digit integer (1-7).

Contingency Table

table(department,super)

        1    2   3   4   5   6   7   8
ACS     32  10   7  24  50   7  24  14
AMA      0   4   2   6  10   3  11   1
...

So a standard contingency table with frequencies.

So far I know I can create a logical test which tests if the cell contents is less than 5:

depSupCrosstab <- depSupCrosstab[,2:8]>5

What I don't know is how to use the matrix that this line of code creates to drop whole rows if they have any FALSE entries.

Hope that helps. I'm afraid I'm new at this, but there's only one way to learn...


Solution

  • I think I've found the answer in a related question. apply is your friend in this case, as it can iterate over cols or rows.

    To create an analogous data frame to yours and then select only rows where all cols are > 5, one can use the following:

    set.seed(1985)
    tosub <- data.frame(matrix(round(runif(n = 80, min = 0, max = 100)), ncol = 8))
    head(tosub,2)
    x <- apply(tosub[,1:8] > 5, MARGIN = 1, all)
    summary(x)
    tosub[which(x),]
    
       X1 X2 X3 X4 X5 X6 X7 X8
    1  66 30 72 59 26 69 76 47
    2  27 42 26 95 66 14 67 18
    4  42 28 93  7 35 35 95 23
    5  38 89 69 91 98 91 60 69
    9  89 31 91 72 28 31 58 58
    10 53 87 27 89 95 37 98 20