Hi I need to concatenate strings by groups for multiple columns. I realise that versions of this question has been asked several times (see Aggregating by unique identifier and concatenating related values into a string), but they usually involve concatenating values of a single column.
My dataset is something like:
Sample group Gene1 Gene2 Gene3
A 1 a NA NA
A 2 b NA NA
B 1 NA c NA
C 1 a NA d
C 2 b NA e
C 3 c NA NA
I want to get it into a format where each samples takes only 1 row (the group column is optional):
Sample group Gene1 Gene2 Gene3
A 1,2 a,b NA NA
B 1 NA c NA
C 1,2,3 a,b,c NA d,e
Since the number of genes can go up to the thousands, I can't simply specify the columns that I wish to concatenate.
I know aggregate
or dplyr
can be used to get the groups but I can't figure out how to do it for multiple columns.
Thanks in advance!
As my dataset is very large containing thousands of genes, I realised dplyr is too slow. I've been experimenting with data.table and the following code can also get what I want:
setDT(df)[, lapply(.SD, function(x) paste(na.omit(x), collapse = ",")), by = Sample]
The output is now:
Sample group Gene1 Gene2 Gene3
1: A 1,2 a,b
2: B 1 c
3: C 1,2,3 a,b,c d,e
Thanks for all your help!
For these purposes, there are the summarise_all
, summarise_at
, and summarise_if
functions. Using summarise_all
:
df %>%
group_by(Sample) %>%
summarise_all(funs(paste(na.omit(.), collapse = ",")))
# A tibble: 3 × 5 Sample group Gene1 Gene2 Gene3 <chr> <chr> <chr> <chr> <chr> 1 A 1,2 a,b 2 B 1 c 3 C 1,2,3 a,b,c d,e
Update: In current versions of dplyr
, it is encouraged to combine summarise
with across
, e.g. like this:
df %>%
group_by(Sample) %>%
summarise(across(everything(), \(x) paste(na.omit(x), collapse = ",")))