I am working with GSEA analyses (from clusterProfiler
package) and want to perform leading edge analyses. For this I need to extract raw data from a gseaResult
.
#FYI my code looks like this:
GSEA_GO <- gseGO(geneList=gene_list, keyType = "SYMBOL", OrgDb = org.Hs.eg.db)
View(data.frame(GSEA_GO@result))
#after extraction and data transformation, this is a reprex of what I end with:
#one letter being a gene name (included in the leading edge), and "GSx" being a gene set
GS1 <- c("a", "b", "c", "d", "e", "f")
GS2 <- c("b", "c", "d", "e", "f", "g")
GS3 <- c("a", "b", "c", NA,NA,NA)
GS4 <- c("a", "d", "e", "g", NA, NA)
GS5 <- c("a", "b", "c", "d", NA, NA)
df <- data.frame(rbind(GS1, GS2, GS3, GS4, GS5))
In order to go further, I must transform this table to another in which every column represent the presence (=1) or the absence (=0) of the gene in the gene set (ie in the row). It would looks to something like this:
Of course I have hundreds of genes, and hundreds of gene set... I don't want to do everything by hand with ifelses... could anyone provide some clues for going to the right direction? Thanks!
Probably you will have an easier time defining you data in "long" format and then reshaping it wider, either with pivot_wider()
or using the package fastDummies
:
library(tidyverse)
GS1 <- c("a", "b", "c", "d", "e", "f")
GS2 <- c("b", "c", "d", "e", "f", "g")
GS3 <- c("a", "b", "c", NA,NA,NA)
GS4 <- c("a", "d", "e", "g", NA, NA)
GS5 <- c("a", "b", "c", "d", NA, NA)
df <- tibble(
name = rep(c("GS1", "GS2", "GS3", "GS4", "GS5"), each = 6L),
value = c(GS1, GS2, GS3, GS4, GS5)
)
df |>
filter(!is.na(value)) |>
fastDummies::dummy_cols("value", omit_colname_prefix = TRUE) |>
summarize(across(!value, sum), .by = name)
#> # A tibble: 5 × 8
#> name a b c d e f g
#> <chr> <int> <int> <int> <int> <int> <int> <int>
#> 1 GS1 1 1 1 1 1 1 0
#> 2 GS2 0 1 1 1 1 1 1
#> 3 GS3 1 1 1 0 0 0 0
#> 4 GS4 1 0 0 1 1 0 1
#> 5 GS5 1 1 1 1 0 0 0
Created on 2023-12-15 with reprex v2.0.2
An even faster way is to use table()
to get a quick contingency table (and then convert into a dataframe if necessary):
df |>
filter(!is.na(value)) |>
table() |>
as.data.frame.matrix()