Search code examples
rdataframetransformation

How to transform a table with factors to a table with counts (in R)?


I am working with GSEA analyses (from clusterProfiler package) and want to perform leading edge analyses. For this I need to extract raw data from a gseaResult.

#FYI my code looks like this:
GSEA_GO <- gseGO(geneList=gene_list, keyType = "SYMBOL", OrgDb = org.Hs.eg.db)
View(data.frame(GSEA_GO@result))
#after extraction and data transformation, this is a reprex of what I end with:
#one letter being a gene name (included in the leading edge), and "GSx" being a gene set
GS1 <- c("a", "b", "c", "d", "e", "f") 
GS2 <- c("b", "c", "d", "e", "f", "g") 
GS3 <- c("a", "b", "c", NA,NA,NA) 
GS4 <- c("a", "d", "e", "g", NA, NA) 
GS5 <- c("a", "b", "c", "d", NA, NA) 
df <- data.frame(rbind(GS1, GS2, GS3, GS4, GS5))

table1

In order to go further, I must transform this table to another in which every column represent the presence (=1) or the absence (=0) of the gene in the gene set (ie in the row). It would looks to something like this:

table2

Of course I have hundreds of genes, and hundreds of gene set... I don't want to do everything by hand with ifelses... could anyone provide some clues for going to the right direction? Thanks!


Solution

  • Probably you will have an easier time defining you data in "long" format and then reshaping it wider, either with pivot_wider() or using the package fastDummies:

    library(tidyverse)
    
    GS1 <- c("a", "b", "c", "d", "e", "f") 
    GS2 <- c("b", "c", "d", "e", "f", "g") 
    GS3 <- c("a", "b", "c", NA,NA,NA) 
    GS4 <- c("a", "d", "e", "g", NA, NA) 
    GS5 <- c("a", "b", "c", "d", NA, NA) 
    
    df <- tibble(
      name = rep(c("GS1", "GS2", "GS3", "GS4", "GS5"), each = 6L),
      value = c(GS1, GS2, GS3, GS4, GS5)
    )
    
    df |> 
      filter(!is.na(value)) |> 
      fastDummies::dummy_cols("value", omit_colname_prefix = TRUE) |> 
      summarize(across(!value, sum), .by = name)
    #> # A tibble: 5 × 8
    #>   name      a     b     c     d     e     f     g
    #>   <chr> <int> <int> <int> <int> <int> <int> <int>
    #> 1 GS1       1     1     1     1     1     1     0
    #> 2 GS2       0     1     1     1     1     1     1
    #> 3 GS3       1     1     1     0     0     0     0
    #> 4 GS4       1     0     0     1     1     0     1
    #> 5 GS5       1     1     1     1     0     0     0
    

    Created on 2023-12-15 with reprex v2.0.2

    An even faster way is to use table() to get a quick contingency table (and then convert into a dataframe if necessary):

    df |> 
      filter(!is.na(value)) |> 
      table() |> 
      as.data.frame.matrix()