Given data such as:
A B C
1 Group 1 Group 2
2 Property 1 56 651
3 Property 2 97 1,380
how can one calculate the p-value (i.e., the "right-tail" probability of the chi-squared distribution) directly, without setting up a separate calculation for the expected values of the table?
The p-value is calculated in Excel by the function ChiSq.dist.RT
if you know the value of chi-squared for the table or by ChiSq.Test
if you know the table of "expected values" for the table. The value of chi-squared is calculated using the expected values, and the expected values are calculated from the original table by a somewhat complicated formula, so either way, Excel requires us to calculate the expected values ourselves in order to get the p-value, which seems kind-of silly. So, how can I get the p-value in Excel without calculating the expected values separately?
Edit: This question was originally posted with the title "How to calculate Pearson correlation coefficient with 2-property arrays?" and asked why the function pearson was giving the wrong answer. Well, the answer to that was that I was confusing p-value with the Pearson correlation coefficient, which are different things. So I've reformulated the question to ask what I really needed to know, and am posting the answer. I'll wait a while before accepting my own answer in case someone else has a better one.
It appears to me that this requires VBA. I've written the following VBA function to calculate either chi-squared or the p-value, as well as two other measures of association for a 2x2 contingency table:
Public Function nStatAssoc_2x2(sType As String, nGrp1PropCounts As Range, nGrp2PropCounts As Range) As Single
' Return one of several measures of statistical association of a 2×2 contingency table:
' Property 1 Property 2
' Group 1 nCount(1, 1) nCount(1, 2)
' Group 2 nCount(2, 1) nCount(2, 2)
' sType is: to calculate:
' "OR" Odds ratio
' "phi" Phi coefficient
' "chi-sq" Chi-squared
' "p" p-value, i.e., right-tailed probability of the chi-squared distribution
' nGrp<n>PropCounts is a range of two cells containing the number of members of group n that have each of two properties.
' These arguments are 1-D arrays in order to allow the data to appear in non-adjacent ranges in the spreadsheet.
' References:
' Contingency table: https://en.wikipedia.org/wiki/Contingency_table
' Measure of association: www.britannica.com/topic/measure-of-association
' Odds ratio: https://en.wikipedia.org/wiki/Odds_ratio
' https://en.wikipedia.org/wiki/Effect_size#Odds_ratio
' Phi coefficient: https://en.wikipedia.org/wiki/Phi_coefficient
' Chi-sq: https://en.wikipedia.org/wiki/Pearson's_chi-squared_test#Calculating_the_test-statistic
' www.mathsisfun.com/data/chi-square-test.html
' Shows calculation of expected values.
' p-value: https://learn.microsoft.com/en-us/office/vba/api/excel.worksheetfunction.ChiSq_Dist_RT
Dim nCount(1 To 2, 1 To 2) As Integer
Dim nSumGrp(1 To 2) As Integer, nSumProp(1 To 2) As Integer, nSumAll As Integer
Dim nExpect(1 To 2, 1 To 2) As Single
Dim nIndex1 As Byte, nIndex2 As Byte
Dim nRetVal As Single
' Combine input arguments into contingency table:
For nIndex1 = 1 To 2
nCount(1, nIndex1) = nGrp1PropCounts(nIndex1)
nCount(2, nIndex1) = nGrp2PropCounts(nIndex1)
Next nIndex1
' Calculate totals of group counts, property counts, and all counts (used for phi and chi-sq):
For nIndex1 = 1 To 2
For nIndex2 = 1 To 2
nSumGrp(nIndex1) = nSumGrp(nIndex1) + nCount(nIndex1, nIndex2)
nSumProp(nIndex2) = nSumProp(nIndex2) + nCount(nIndex1, nIndex2)
Next nIndex2
Next nIndex1
nSumAll = nSumGrp(1) + nSumGrp(2)
If nSumAll <> nSumProp(1) + nSumProp(2) Then
nRetVal = -2 ' Error: Sums differ.
GoTo Finished
End If
Select Case sType
' Odds ratio
Case "OR":
nRetVal = (nCount(1, 1) / nCount(1, 2)) / (nCount(2, 1) / nCount(2, 2))
If nRetVal <> (nCount(1, 1) / nCount(2, 1)) / (nCount(1, 2) / nCount(2, 2)) Then
nRetVal = -3 ' Error: OR calculation results differ.
GoTo Finished
End If
' Phi coefficient
Case "phi":
nRetVal = ((CLng(nCount(1, 1)) * nCount(2, 2)) - (CLng(nCount(1, 2)) * nCount(2, 1))) / _
(CSng(nSumGrp(1)) * nSumGrp(2) * nSumProp(1) * nSumProp(2)) ^ 0.5
' Chi-squared
Case "chi-sq", "p": ' For "p", nRetVal is passed to the next select case statement.
' Calculate table of expected values:
For nIndex1 = 1 To 2
For nIndex2 = 1 To 2
' In next line, the division is done first to prevent integer overflow,
' which can happen if the multiplication is done first.
nExpect(nIndex1, nIndex2) = nSumGrp(nIndex1) / nSumAll * nSumProp(nIndex2)
If nExpect(nIndex1, nIndex2) < 5 Then
' https://en.wikipedia.org/wiki/Pearson's_chi-squared_test#Assumptions
nRetVal = -4 ' Error: Expected value too small.
GoTo Finished
Else
nRetVal = nRetVal + _
(nCount(nIndex1, nIndex2) - nExpect(nIndex1, nIndex2)) ^ 2 / nExpect(nIndex1, nIndex2)
End If
Next nIndex2
Next nIndex1
Case Else:
nRetVal = -1 ' Error: Invalid measure type.
GoTo Finished
End Select
Select Case sType
Case "OR", "phi", "chi-sq":
' p-value ' Uses value of nRetVal passed from the previous select case statement.
Case "p": nRetVal = WorksheetFunction.ChiSq_Dist_RT(nRetVal, 1)
End Select
Finished: nStatAssoc_2x2 = nRetVal
End Function ' nStatAssoc_2x2()
The function is tested in Excel 2019 and yields correct values for all four measures for several test tables. Criticisms of or suggestions for improving the code are welcome.
If I'm wrong, and this doesn't require VBA or for any other reason there's a better way to do this, please post a different answer with that. As I said in the edit note in my question, I'll wait a while before accepting my answer to see if someone else has a better one.