Image a dataframe looking like so:
Gene <- c("A", "A", "A", "A", "B", "B", "B", "C", "D", "D", "E")
c1 <- c(2, 2, 2, 3, 3, 3, 3, 4, 2, 4, 5)
df1 <- data.frame(Gene, c1)
head(df1, 10)
Gene c1
1 A 2
2 A 2
3 A 2
4 A 3
5 B 3
6 B 3
7 B 3
8 C 4
9 D 2
10 D 4
11 E 5
I want to do a chi-square test but need data to be in a different format for this. I want to make subsets of the data where I make 4 different datasets only containing genes A+B, A+C, A+D and A+E, as I want to statistically compare all to Gene A. To get for example A+B I do:
df2<- df1 [df1$Gene == "A" | df1$Gene == "B", ]
head(df2, 10)
Gene c1
1 A 2
2 A 2
3 A 2
4 A 3
3 B 3
4 B 3
5 B 3
I now need the data in the following format to do a Chi-square test:
A B
2 3 0
3 1 3
The first column is the numbers occurring previously in the c1 column. The columns below A and B contain the frequency with which those values occured (so A had three entries for 2 and one for 3; B had three entries for 3). Everything goes well until obtaining the subsets of the data, after which obtaining a frequency table in the right format is a problem.
Response to Andre Wildberg, sadly this doesnt work, I get the following:
Gene c1 Freq
1 A 2 3
2 B 2 0
3 A 3 1
4 B 3 3
The solution provided below works with the example I provided, not with my real data. If I create the frequency table with my real dataset the output I get is wrong. My real data is structured in the same way as the example, 2 columns, one with a groupname and one with counts for values 2, 3, 4 ...etc. I have multiple groups, after I do a simple filter step to create a new dataframe with only 2 groups (for the statistics) I perform your provided suggestions, the output table somehow again contains ALL the groups, even the ones I filtered out beforehand, the "Gene" and "c1" columns are still there, but the Gene column only has 1 row for each group, the c1 column only contains the value 2 and there is now a 3rd column added "freq" which contains how many times the value 2 occurred for each of the groups in the "Gene" column.
I have a clue why this is, I think my actual data is formatted slightly different then the example data I have provided. When I look at the example and real datasets with the "head" command the example data has a first column which doesn't have a column header but contains row numbers, my real data does NOT have such a column. When I do a View command and look at the data in there, both actually start with a non-name column with row numbers... When doing a head command the real data looks like this:
a tibble
Gene c1
A 2
A 2
A 2
A 3
B 3
B 3
B 3
C 4
D 2
You can subset the data and then pass it into table()
.
lapply(unique(df1$Gene[df1$Gene != 'A']),
\(x) table(subset(df1, Gene %in% c('A', x))))
# [[1]]
# c1
# Gene 2 3
# A 3 1
# B 0 3
#
# [[2]]
# c1
# Gene 2 3 4
# A 3 1 0
# C 0 0 1
#
# [[3]]
# ...
#
# [[4]]
# ...
If you need to do the Chi-squared test for each table, just use chisq.test()
for each table inside lapply()
.
lapply(unique(df1$Gene[df1$Gene != 'A']), \(x) {
tab <- table(subset(df1, Gene %in% c('A', x)))
chisq.test(tab)
})
[[1]]
Pearson's Chi-squared test with Yates' continuity correction
data: tab
X-squared = 1.4705, df = 1, p-value = 0.2253
[[2]]
...
[[3]]
...
[[4]]
...