Search code examples
rdatasetchi-squared

Add new value to table() in order to be able to use chi square test


From a single dataset I created two dataset filtering on the target variable. Now I'd like to compare all the features in the dataset using chi square. The problem is that one of the two dataset is much smaller than the other one so in some features I have some values that are not present in the second one and when I try to apply the chi square test I get this error: "all arguments must have the same length".

How can I add to the dataset with less value the missing value in order to be able to use chi square test?

Example:

I want to use chi square on a the same feature in the two dataset:

chisq.test(table(df1$var1, df2$var1))

but I get the error "all arguments must have the same length" because table(df1$var1) is:

a  b  c  d
2  5  7  18

while table(df2$var1) is:

a  b  c
8  1  12

so what I would like to do is adding the value d in df2 and set it equal to 0 in order to be able to use the chi square test.


Solution

  • The table output of df2 can be modified if we convert to factor with levels specified

    table(factor(df2$var1, levels = letters[1:4]))
    
     a  b  c  d 
     8  1 12  0 
    

    But, table with two inputs, should have the same length. For this, we may need to bind the datasets and then use table

    library(dplyr)
    table(bind_rows(df1, df2, .id = 'grp'))
       var1
    grp  a  b  c  d
      1  2  5  7 18
      2  8  1 12  0
    

    Or in base R

    table(data.frame(col1 = rep(1:2, c(nrow(df1), nrow(df2))), 
      col2 = c(df1$var1, df2$var1)))
        col2
    col1  a  b  c  d
       1  2  5  7 18
       2  8  1 12  0
    

    data

    df1 <- structure(list(var1 = c("a", "a", "b", "b", "b", "b", "b", "c", 
    "c", "c", "c", "c", "c", "c", "d", "d", "d", "d", "d", "d", "d", 
    "d", "d", "d", "d", "d", "d", "d", "d", "d", "d", "d")), class = "data.frame", 
    row.names = c(NA, 
    -32L))
    
    df2 <- structure(list(var1 = c("a", "a", "a", "a", "a", "a", "a",
     "a", 
    "b", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c"
    )), class = "data.frame", row.names = c(NA, -21L))