I am trying to create a table with the Sample ID, raw counts, and Gene names.
In this table, a new row is created for each Sample Id to correspond to each gene name:
Sample ID | Gene A | Gene B |
---|---|---|
Sample 1 | 1 | - |
Sample 1 | - | 2 |
Sample 2 | 3 | - |
Sample 2 | - | 4 |
Rather than having numerous rows, I would like to condense them into a single row:
Sample ID | Gene A | Gene B |
---|---|---|
Sample 1 | 1 | 2 |
Sample 2 | 3 | 4 |
This is the code I have so far:
dfwide = data.wide.df %>% group_by(SampleId) %>%
summarise(Sample 1 = sum(Sample 1, na.rm = T),
Sample 2 = sum(Sample 2, na.rm = T))
I have over 1000 samples, so I was hoping to find a way to summarise all the genes at once. Any help would be appreciated!
If you are always guaranteed to have the same number of Gene A
as Gene B
, then this might work:
library(dplyr)
dat %>%
group_by(Sample.ID) %>%
summarize(across(starts_with("Gene"), ~ .[. != "-"]))
# # A tibble: 2 x 3
# Sample.ID Gene.A Gene.B
# <chr> <chr> <chr>
# 1 Sample 1 1 2
# 2 Sample 2 3 4
I've assumed that you have literal "-"
strings; if they are NA
or empty ""
, then that conditional can be modified to account for that.
The risk here is if there are uneven numbers of genes. For example, if the data were instead
dat2
# Sample.ID Gene.A Gene.B
# 1 Sample 1 1 -
# 2 Sample 1 - 2
# 5 Sample 1 - 3
# 3 Sample 2 3 -
# 4 Sample 2 - 4
dat2 %>%
group_by(Sample.ID) %>%
summarize(across(starts_with("Gene"), ~ .[. != "-"]))
# # A tibble: 3 x 3
# # Groups: Sample.ID [2]
# Sample.ID Gene.A Gene.B
# <chr> <chr> <chr>
# 1 Sample 1 1 2
# 2 Sample 1 1 3
# 3 Sample 2 3 4
You see how the 1
is repeated in multiple rows; this worked this time without error because of R's "recycling": since the number of valid strings in Gene.B
is a perfect multiple of the number of valid strings in Gene.A
, there is no complaint and the values are repeated. I consider recycling here to likely be improper, so perhaps not what you need.
If this is the case, it might be more appropriate to store this in "long" format:
dat %>%
tidyr::pivot_longer(-Sample.ID, names_to = "Gene", names_pattern = "Gene\\.(.*)", values_to = "Value") %>%
filter(Value != "-")
# # A tibble: 4 x 3
# Sample.ID Gene Value
# <chr> <chr> <chr>
# 1 Sample 1 A 1
# 2 Sample 1 B 2
# 3 Sample 2 A 3
# 4 Sample 2 B 4
dat2 %>%
tidyr::pivot_longer(-Sample.ID, names_to = "Gene", names_pattern = "Gene\\.(.*)", values_to = "Value") %>%
filter(Value != "-")
# # A tibble: 5 x 3
# Sample.ID Gene Value
# <chr> <chr> <chr>
# 1 Sample 1 A 1
# 2 Sample 1 B 2
# 3 Sample 1 B 3
# 4 Sample 2 A 3
# 5 Sample 2 B 4
which will likely require you to refactor downstream processing, but at least it is safe.
Data:
dat <- structure(list(Sample.ID = c("Sample 1", "Sample 1", "Sample 2", "Sample 2"), Gene.A = c("1", "-", "3", "-"), Gene.B = c("-", "2", "-", "4")), class = "data.frame", row.names = c(NA, -4L))
dat2 <- structure(list(Sample.ID = c("Sample 1", "Sample 1", "Sample 1", "Sample 2", "Sample 2"), Gene.A = c("1", "-", "-", "3", "-"), Gene.B = c("-", "2", "3", "-", "4")), row.names = c(1L, 2L, 5L, 3L, 4L), class = "data.frame")