Reformat dataframe to a frequency table for Chi-square test; with example

Image a dataframe looking like so:

Gene <- c("A", "A", "A", "A", "B", "B", "B", "C", "D", "D", "E")
c1 <- c(2, 2, 2, 3, 3, 3, 3, 4, 2, 4, 5)
df1 <- data.frame(Gene, c1)
head(df1, 10)
    Gene c1
1   A   2       
2   A   2
3   A   2
4   A   3
5   B   3       
6   B   3       
7   B   3       
8   C   4       
9   D   2       
10  D   4       
11  E   5

I want to do a chi-square test but need data to be in a different format for this. I want to make subsets of the data where I make 4 different datasets only containing genes A+B, A+C, A+D and A+E, as I want to statistically compare all to Gene A. To get for example A+B I do:

df2<- df1 [df1$Gene == "A" | df1$Gene == "B", ]
head(df2, 10)
 
    Gene c1
1   A   2       
2   A   2
3   A   2
4   A   3
3   B   3       
4   B   3       
5   B   3

I now need the data in the following format to do a Chi-square test:

    A   B   
2   3   0       
3   1   3

The first column is the numbers occurring previously in the c1 column. The columns below A and B contain the frequency with which those values occured (so A had three entries for 2 and one for 3; B had three entries for 3). Everything goes well until obtaining the subsets of the data, after which obtaining a frequency table in the right format is a problem.

Response to Andre Wildberg, sadly this doesnt work, I get the following:

     Gene     c1    Freq
1    A        2     3
2    B        2     0
3    A        3     1
4    B        3     3

The solution provided below works with the example I provided, not with my real data. If I create the frequency table with my real dataset the output I get is wrong. My real data is structured in the same way as the example, 2 columns, one with a groupname and one with counts for values 2, 3, 4 ...etc. I have multiple groups, after I do a simple filter step to create a new dataframe with only 2 groups (for the statistics) I perform your provided suggestions, the output table somehow again contains ALL the groups, even the ones I filtered out beforehand, the "Gene" and "c1" columns are still there, but the Gene column only has 1 row for each group, the c1 column only contains the value 2 and there is now a 3rd column added "freq" which contains how many times the value 2 occurred for each of the groups in the "Gene" column.

I have a clue why this is, I think my actual data is formatted slightly different then the example data I have provided. When I look at the example and real datasets with the "head" command the example data has a first column which doesn't have a column header but contains row numbers, my real data does NOT have such a column. When I do a View command and look at the data in there, both actually start with a non-name column with row numbers... When doing a head command the real data looks like this:

Solution

You can subset the data and then pass it into table().

lapply(unique(df1$Gene[df1$Gene != 'A']),
       \(x) table(subset(df1, Gene %in% c('A', x))))

# [[1]]
#     c1
# Gene 2 3
#    A 3 1
#    B 0 3
# 
# [[2]]
#     c1
# Gene 2 3 4
#    A 3 1 0
#    C 0 0 1
# 
# [[3]]
# ...
# 
# [[4]]
# ...

If you need to do the Chi-squared test for each table, just use chisq.test() for each table inside lapply().

lapply(unique(df1$Gene[df1$Gene != 'A']), \(x) {
  tab <- table(subset(df1, Gene %in% c('A', x)))
  chisq.test(tab)
})

[[1]]

    Pearson's Chi-squared test with Yates' continuity correction

data:  tab
X-squared = 1.4705, df = 1, p-value = 0.2253

[[2]]
...

[[3]]
...

[[4]]
...