Search code examples
rpython-3.xchi-squared

Different P values for chi square test in Python and R


I am trying to do chi square test on two categories of biological data. I have a data frame like this:

         Brain, Cerebelum, Heart, Kidney,  liver,  testis
expected 3        66       1        44       34       88
observed 6        57       4        45       35       69

structure(list(Brain = c(3L, 6L), Cerebelum = c(66L, 57L), heart = c(1L, 
4L), kidney = 44:45, liver = 34:35, testis = c(88L, 69L)), .Names = c("Brain", 
"Cerebelum", "heart", "kidney", "liver", "testis"), class = "data.frame", row.names = c("rand", 
"cns"))

I did the test using Python:

from scipy.stats import chisquare
chisquare(obs,f_exp=exp)

which gives result as:

Power_divergenceResult(statistic=17.381684491978611, pvalue=0.0038300192430189722)

I tried to replicate the results using R, so I made the csv file, imported to R as dataframe and run the code as:

d<-read.csv(file)
chisq.test(d)

Pearson's Chi-squared test

data:  d
X-squared = 4.9083, df = 5, p-value = 0.4272

why the chi squared value and P value is different in python and R?, As I calculated by hand using the simple (O-E)^2/E formula, the chi square value is equal to 17.38 as calculated by python but I can not figure out how R calculate the value of 4.90.


Solution

  • I can answer your first question.

    chisq.test, when you give it a matrix with > 2 rows and columns, treats it as two-dimensional contingency table and tests for independence between observations along the rows and columns. Here's an example and another one.

    scipy.stats.chisq on the other hand just does the X = sum( (O_i-E_i)^2 / E_i) familiar from the definition of the test stat.

    So how to square the circle? First, pass R the observed values, then define the expected probabilities in argument p. Second, you also need to stop R from doing a default continuity correction.

    e <- d[1, ]
    o <- d[2, ]
    chisq.test(o, p = e / sum(e), correct = FALSE)
    

    voila

    Chi-squared test for given probabilities
    
    data:  o
    X-squared = 17.139, df = 5, p-value = 0.004243
    

    PS Tricky question for SO, possibly better for crossvalidated? Note that R's default correction may be a good thing vs scipy. Whether that is true is definitely for crossvalidated.

    PPS The help in ?chisq.test is a litttttttle hard to parse, but I think this is all in there somewhere ;)

     If ‘x’ is a matrix with one row or column, or if ‘x’ is a vector
     and ‘y’ is not given, then a _goodness-of-fit test_ is performed
     (‘x’ is treated as a one-dimensional contingency table).  The
     entries of ‘x’ must be non-negative integers.  In this case, the
     hypothesis tested is whether the population probabilities equal
     those in ‘p’, or are all equal if ‘p’ is not given.
    
     If ‘x’ is a matrix with at least two rows and columns, it is taken
     as a two-dimensional contingency table: the entries of ‘x’ must be
     non-negative integers.  Otherwise, ‘x’ and ‘y’ must be vectors or
     factors of the same length; cases with missing values are removed,
     the objects are coerced to factors, and the contingency table is
     computed from these.  Then Pearson's chi-squared test is performed
     of the null hypothesis that the joint distribution of the cell
     counts in a 2-dimensional contingency table is the product of the
     row and column marginals.
    

    and

     correct: a logical indicating whether to apply continuity correction
              when computing the test statistic for 2 by 2 tables: one half
              is subtracted from all |O - E| differences; however, the
              correction will not be bigger than the differences
              themselves.  No correction is done if ‘simulate.p.value =
              TRUE’.