Search code examples
rdplyr

Reformat dataframe to a frequency table for Chi-square test; with example


Image a dataframe looking like so:

Gene <- c("A", "A", "A", "A", "B", "B", "B", "C", "D", "D", "E")
c1 <- c(2, 2, 2, 3, 3, 3, 3, 4, 2, 4, 5)
df1 <- data.frame(Gene, c1)
head(df1, 10)
    Gene c1
1   A   2       
2   A   2
3   A   2
4   A   3
5   B   3       
6   B   3       
7   B   3       
8   C   4       
9   D   2       
10  D   4       
11  E   5

I want to do a chi-square test but need data to be in a different format for this. I want to make subsets of the data where I make 4 different datasets only containing genes A+B, A+C, A+D and A+E, as I want to statistically compare all to Gene A. To get for example A+B I do:

df2<- df1 [df1$Gene == "A" | df1$Gene == "B", ]
head(df2, 10)
 
    Gene c1
1   A   2       
2   A   2
3   A   2
4   A   3
3   B   3       
4   B   3       
5   B   3   

I now need the data in the following format to do a Chi-square test:

    A   B   
2   3   0       
3   1   3           

The first column is the numbers occurring previously in the c1 column. The columns below A and B contain the frequency with which those values occured (so A had three entries for 2 and one for 3; B had three entries for 3). Everything goes well until obtaining the subsets of the data, after which obtaining a frequency table in the right format is a problem.

Response to Andre Wildberg, sadly this doesnt work, I get the following:

     Gene     c1    Freq
1    A        2     3
2    B        2     0
3    A        3     1
4    B        3     3   

The solution provided below works with the example I provided, not with my real data. If I create the frequency table with my real dataset the output I get is wrong. My real data is structured in the same way as the example, 2 columns, one with a groupname and one with counts for values 2, 3, 4 ...etc. I have multiple groups, after I do a simple filter step to create a new dataframe with only 2 groups (for the statistics) I perform your provided suggestions, the output table somehow again contains ALL the groups, even the ones I filtered out beforehand, the "Gene" and "c1" columns are still there, but the Gene column only has 1 row for each group, the c1 column only contains the value 2 and there is now a 3rd column added "freq" which contains how many times the value 2 occurred for each of the groups in the "Gene" column.

I have a clue why this is, I think my actual data is formatted slightly different then the example data I have provided. When I look at the example and real datasets with the "head" command the example data has a first column which doesn't have a column header but contains row numbers, my real data does NOT have such a column. When I do a View command and look at the data in there, both actually start with a non-name column with row numbers... When doing a head command the real data looks like this:

a tibble 

Gene c1
A   2       
A   2
A   2
A   3
B   3       
B   3       
B   3       
C   4       
D   2       

Solution

  • You can subset the data and then pass it into table().

    lapply(unique(df1$Gene[df1$Gene != 'A']),
           \(x) table(subset(df1, Gene %in% c('A', x))))
    
    # [[1]]
    #     c1
    # Gene 2 3
    #    A 3 1
    #    B 0 3
    # 
    # [[2]]
    #     c1
    # Gene 2 3 4
    #    A 3 1 0
    #    C 0 0 1
    # 
    # [[3]]
    # ...
    # 
    # [[4]]
    # ...
    

    If you need to do the Chi-squared test for each table, just use chisq.test() for each table inside lapply().

    lapply(unique(df1$Gene[df1$Gene != 'A']), \(x) {
      tab <- table(subset(df1, Gene %in% c('A', x)))
      chisq.test(tab)
    })
    
    [[1]]
    
        Pearson's Chi-squared test with Yates' continuity correction
    
    data:  tab
    X-squared = 1.4705, df = 1, p-value = 0.2253
    
    [[2]]
    ...
    
    [[3]]
    ...
    
    [[4]]
    ...