Search code examples
excel-formulap-valuechi-squared

Excel statisticals: How to calculate p-value of a 2x2 contingency table?


Given data such as:

        A         B           C
1               Group 1     Group 2
2   Property 1     56         651
3   Property 2     97       1,380

how can one calculate the p-value (i.e., the "right-tail" probability of the chi-squared distribution) directly, without setting up a separate calculation for the expected values of the table?

The p-value is calculated in Excel by the function ChiSq.dist.RT if you know the value of chi-squared for the table or by ChiSq.Test if you know the table of "expected values" for the table. The value of chi-squared is calculated using the expected values, and the expected values are calculated from the original table by a somewhat complicated formula, so either way, Excel requires us to calculate the expected values ourselves in order to get the p-value, which seems kind-of silly. So, how can I get the p-value in Excel without calculating the expected values separately?

Edit: This question was originally posted with the title "How to calculate Pearson correlation coefficient with 2-property arrays?" and asked why the function pearson was giving the wrong answer. Well, the answer to that was that I was confusing p-value with the Pearson correlation coefficient, which are different things. So I've reformulated the question to ask what I really needed to know, and am posting the answer. I'll wait a while before accepting my own answer in case someone else has a better one.


Solution

  • It appears to me that this requires VBA. I've written the following VBA function to calculate either chi-squared or the p-value, as well as two other measures of association for a 2x2 contingency table:

    Public Function nStatAssoc_2x2(sType As String, nGrp1PropCounts As Range, nGrp2PropCounts As Range) As Single
    
    ' Return one of several measures of statistical association of a 2×2 contingency table:
    '                   Property 1      Property 2
    '       Group 1     nCount(1, 1)    nCount(1, 2)
    '       Group 2     nCount(2, 1)    nCount(2, 2)
    
    ' sType is:     to calculate:
    '   "OR"        Odds ratio
    '   "phi"       Phi coefficient
    '   "chi-sq"    Chi-squared
    '   "p"         p-value, i.e., right-tailed probability of the chi-squared distribution
    
    ' nGrp<n>PropCounts is a range of two cells containing the number of members of group n that have each of two properties.
    ' These arguments are 1-D arrays in order to allow the data to appear in non-adjacent ranges in the spreadsheet.
    
    ' References:
        ' Contingency table:        https://en.wikipedia.org/wiki/Contingency_table
        ' Measure of association:   www.britannica.com/topic/measure-of-association
        ' Odds ratio:               https://en.wikipedia.org/wiki/Odds_ratio
        '                           https://en.wikipedia.org/wiki/Effect_size#Odds_ratio
        ' Phi coefficient:          https://en.wikipedia.org/wiki/Phi_coefficient
        ' Chi-sq:                   https://en.wikipedia.org/wiki/Pearson's_chi-squared_test#Calculating_the_test-statistic
        '                           www.mathsisfun.com/data/chi-square-test.html
        '                               Shows calculation of expected values.
        ' p-value:                  https://learn.microsoft.com/en-us/office/vba/api/excel.worksheetfunction.ChiSq_Dist_RT
    
    Dim nCount(1 To 2, 1 To 2) As Integer
    Dim nSumGrp(1 To 2) As Integer, nSumProp(1 To 2) As Integer, nSumAll As Integer
    Dim nExpect(1 To 2, 1 To 2) As Single
    Dim nIndex1 As Byte, nIndex2 As Byte
    Dim nRetVal As Single
    
    ' Combine input arguments into contingency table:
    For nIndex1 = 1 To 2
        nCount(1, nIndex1) = nGrp1PropCounts(nIndex1)
        nCount(2, nIndex1) = nGrp2PropCounts(nIndex1)
      Next nIndex1
    
    ' Calculate totals of group counts, property counts, and all counts (used for phi and chi-sq):
    For nIndex1 = 1 To 2
        For nIndex2 = 1 To 2
            nSumGrp(nIndex1) = nSumGrp(nIndex1) + nCount(nIndex1, nIndex2)
            nSumProp(nIndex2) = nSumProp(nIndex2) + nCount(nIndex1, nIndex2)
          Next nIndex2
      Next nIndex1
    nSumAll = nSumGrp(1) + nSumGrp(2)
    
    If nSumAll <> nSumProp(1) + nSumProp(2) Then
        nRetVal = -2           ' Error: Sums differ.
        GoTo Finished
      End If
    
    Select Case sType
    
        ' Odds ratio
        Case "OR":
            nRetVal = (nCount(1, 1) / nCount(1, 2)) / (nCount(2, 1) / nCount(2, 2))
            If nRetVal <> (nCount(1, 1) / nCount(2, 1)) / (nCount(1, 2) / nCount(2, 2)) Then
                nRetVal = -3            ' Error: OR calculation results differ.
                GoTo Finished
              End If
    
        ' Phi coefficient
        Case "phi":
            nRetVal = ((CLng(nCount(1, 1)) * nCount(2, 2)) - (CLng(nCount(1, 2)) * nCount(2, 1))) / _
                        (CSng(nSumGrp(1)) * nSumGrp(2) * nSumProp(1) * nSumProp(2)) ^ 0.5
    
        ' Chi-squared
        Case "chi-sq", "p":     ' For "p", nRetVal is passed to the next select case statement.
            ' Calculate table of expected values:
            For nIndex1 = 1 To 2
                For nIndex2 = 1 To 2
                        ' In next line, the division is done first to prevent integer overflow,
                        '   which can happen if the multiplication is done first.
                    nExpect(nIndex1, nIndex2) = nSumGrp(nIndex1) / nSumAll * nSumProp(nIndex2)
                    If nExpect(nIndex1, nIndex2) < 5 Then
                        ' https://en.wikipedia.org/wiki/Pearson's_chi-squared_test#Assumptions
                        nRetVal = -4        ' Error: Expected value too small.
                        GoTo Finished
                      Else
                        nRetVal = nRetVal + _
                            (nCount(nIndex1, nIndex2) - nExpect(nIndex1, nIndex2)) ^ 2 / nExpect(nIndex1, nIndex2)
                      End If
                  Next nIndex2
              Next nIndex1
    
        Case Else:
            nRetVal = -1           ' Error: Invalid measure type.
            GoTo Finished
      End Select
    
    Select Case sType
        Case "OR", "phi", "chi-sq":
    
        ' p-value       ' Uses value of nRetVal passed from the previous select case statement.
        Case "p": nRetVal = WorksheetFunction.ChiSq_Dist_RT(nRetVal, 1)
      End Select
    
    Finished: nStatAssoc_2x2 = nRetVal
    
    End Function        ' nStatAssoc_2x2()
    

    The function is tested in Excel 2019 and yields correct values for all four measures for several test tables. Criticisms of or suggestions for improving the code are welcome.

    If I'm wrong, and this doesn't require VBA or for any other reason there's a better way to do this, please post a different answer with that. As I said in the edit note in my question, I'll wait a while before accepting my answer to see if someone else has a better one.