I apologize if this is a very naive question...
I have 7000 2x4 contingency tables with count data. They represent a particular position in a genome and the number of times each dna nucleotide is observed at that position in 2 different environments. an example contingency table would be
A C G T
condition1 0 2 20 70000
condition2 3 15 0 95000
or
A C G T
condition1 80146 0 5 0
condition2 26821 2 4 0
The data can only be positive integers. Minimum counts are 0 and maximum can go up to ~800,000. One count is generally nearly all of the total counts for that row and column (e.g. the same in both conditions, for example cell T in the first case above and cell A in the second), and then 1 or 2 other cells will have low counts... it is in these other cells where the difference, if any, should be observed.
The goal is to identify the positions which are significantly different between these 2 environmental conditions to further analyze. Our measurement method is estimated to have an error rate of 10^-6.
I am using R to analyze this data. I am not sure I can run a chi square test on this because of having cells with small or 0 counts. With the fisher's test I get 2 errors:
with a workspace of 1E5
FEXACT error 40.
Out of workspace.
with a workspace of >3E5
FEXACT error 501.
The hash table key cannot be computed because the largest key
is larger than the largest representable int.
The algorithm cannot proceed.
Reduce the workspace size or use another algorithm.
Can anyone suggest an appropriate test, or setting for the fisher or chi square?
Many thanks in advance,
Ron
Chi-square test works:
df1 = structure(list(A = c(0L, 3L), C = c(2L, 15L), G = c(20L, 0L),
T = c(70000L, 95000L)), .Names = c("A", "C", "G", "T"), class = "data.frame", row.names = 1:2)
df1
A C G T
1 0 2 20 70000
2 3 15 0 95000
chisq.test(df1)
Pearson's Chi-squared test
data: df1
X-squared = 35.8943, df = 3, p-value = 7.884e-08
Warning message:
In chisq.test(df1) : Chi-squared approximation may be incorrect
I am not sure if this is sufficient.