Consider the following dataframe:
dummy_df <- tibble(
A=c("ABC", "ABC", "BCD", "CDF", "CDF", "CDF"),
B=c(0.25, 0.25, 1.23, 0.58, 0.58, 0.58),
C=c("lorem", "ipsum", "dolor", "amet", "something", "else"),
D=c("up", "up", "down", "down", "down", "down"),
E=c(132, 132, 243, 512, 512, 512),
F=c("m1", "m1", "m5", "m3", "m3", "m3"),
G=c("val", "val", "mur", "mad", "mad", "mad"),
H=c("grx", "grx", "bcn", "sal", "sal", "sal"),
I=c(1.68, 1.68, 2.31, 3.12, 3.12, 3.12),
J=c("p", "p", "f", "p", "p", "p"),
K=c(100, 100, 200, 143, 143, 143)
)
# A tibble: 6 × 11
A B C D E F G H I J K
<chr> <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl>
1 ABC 0.25 lorem up 132 m1 val grx 1.68 p 100
2 ABC 0.25 ipsum up 132 m1 val grx 1.68 p 100
3 BCD 1.23 dolor down 243 m5 mur bcn 2.31 f 200
4 CDF 0.58 amet down 512 m3 mad sal 3.12 p 143
5 CDF 0.58 something down 512 m3 mad sal 3.12 p 143
6 CDF 0.58 else down 512 m3 mad sal 3.12 p 143
After reading this, I have managed to collapse column C so that its value is concatenated into a string for every unique row value of column A.
dummy_df %>% group_by(A) %>% summarise(hits = toString(C), nhits=n())
# A tibble: 3 × 3
A hits nhits
<chr> <chr> <int>
1 ABC lorem, ipsum 2
2 BCD dolor 1
3 CDF amet, something, else 3
However, I'm loosing every other column information, which are essential to me. How can I retain information about all the columns while collapsing column C? Ideally it should be done without having to hard-code the columns names, since the number of columns can vary depending on the dataset.
I have read this, but the example shown doesn't create new variables so I haven't been able to make it work.
This is what I'm looking for:
# A tibble: 3 × 12
A hits nhits B D E F G H I J K
<chr> <chr> <int> <dbl> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl>
1 ABC lorem, ipsum 2 0.25 up 132 m1 val grx 1.68 p 100
2 BCD dolor 1 1.23 down 243 m5 mur bcn 2.31 f 200
3 CDF amet, something, else 3 0.58 down 512 m3 mad sal 3.12 p 143
If we assume that all fields other than C
will contain repeating values for each A
-group, then we can do this:
dummy_df %>%
group_by(A) %>%
summarize(
nhits = n(),
across(where(is.character), ~ toString(unique(.))),
across(where(~ !is.character(.)), ~ unique(.))
) %>%
ungroup()
# # A tibble: 3 × 12
# A nhits C D F G H J B E I K
# <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 ABC 2 lorem, ipsum up m1 val grx p 0.25 132 1.68 100
# 2 BCD 1 dolor down m5 mur bcn f 1.23 243 2.31 200
# 3 CDF 3 amet, something, else down m3 mad sal p 0.58 512 3.12 143
In this case, since all other variables are repeating, everything stays as-is. If, however, we have variability, then I've set up a catch, where strings are combined (with toString
) and differing numbers cause repeat rows.
dummy_df$H[1] <- "GRX"
dummy_df$K[1] <- 99
%>%
group_by(A) %>%
summarize(
nhits = n(),
across(where(is.character), ~ toString(unique(.))),
across(where(~ !is.character(.)), ~ unique(.))
) %>%
ungroup()
# A tibble: 4 × 12
A nhits C D F G H J B E I K
<chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 ABC 2 lorem, ipsum up m1 val GRX, grx p 0.25 132 1.68 99
2 ABC 2 lorem, ipsum up m1 val GRX, grx p 0.25 132 1.68 100
3 BCD 1 dolor down m5 mur bcn f 1.23 243 2.31 200
4 CDF 3 amet, something, else down m3 mad sal p 0.58 512 3.12 143
The across
function iterates over multiple columns. The where
function allows us to subset columns by something about their values, where for character
columns we will apply toString
, and for others we'll go with unique
. This means that string columns should never add rows, but non-strings may.
An alternative for the number-like columns would be to use first(.)
instead of unique(.)
, which will silently discard all but the first value in each column by group. Using the same modified data, we would see this, where the key difference is that K
has discarded the 100
values since the first value is 99
.
dummy_df %>%
group_by(A) %>%
summarize(
nhits = n(),
across(where(is.character), ~ toString(unique(.))),
across(where(~ !is.character(.)), ~ first(.))
) %>%
ungroup()
# # A tibble: 3 × 12
# A nhits C D F G H J B E I K
# <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 ABC 2 lorem, ipsum up m1 val GRX, grx p 0.25 132 1.68 99
# 2 BCD 1 dolor down m5 mur bcn f 1.23 243 2.31 200
# 3 CDF 3 amet, something, else down m3 mad sal p 0.58 512 3.12 143
You can choose to use first
with string columns as well, if you choose, which could simplify the logic to a single across
:
dummy_df %>%
group_by(A) %>%
summarize(
nhits = n(),
across(everything(), ~ first(.))
) %>%
ungroup()
# # A tibble: 3 × 12
# A nhits B C D E F G H I J K
# <chr> <int> <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl>
# 1 ABC 2 0.25 lorem up 132 m1 val GRX 1.68 p 99
# 2 BCD 1 1.23 dolor down 243 m5 mur bcn 2.31 f 200
# 3 CDF 3 0.58 amet down 512 m3 mad sal 3.12 p 143
Note that we've lost H
's "grx"
(since "GRX"
was first) as well as K
's value of 100
.
The data.table
variant of this last code-block can be simply:
library(data.table)
as.data.table(dummy_df)[, c(.(nhits = .N), lapply(.SD, first)), by = A]
and a base R rendition? Perhaps not as "pretty" :-)
Reduce(
function(a, b) merge(a, b, by = "A", all = TRUE),
list(
setNames(aggregate(C ~ A, dummy_df, FUN = length), c("A", "nhits")),
aggregate(C ~ A, dummy_df, FUN = toString),
aggregate(. ~ A, subset(dummy_df, select = -C), FUN = function(z) z[1])
)
)
# A nhits C B D E F G H I J K
# 1 ABC 2 lorem, ipsum 0.25 up 132 m1 val GRX 1.68 p 99
# 2 BCD 1 dolor 1.23 down 243 m5 mur bcn 2.31 f 200
# 3 CDF 3 amet, something, else 0.58 down 512 m3 mad sal 3.12 p 143