I have a dataframe that contains multiple observations for some individuals. Some observations are missing data, but it isn't always the same data, so I am trying to merge the observations to get one row for each individual that is as complete as possible.
Here is a toy dataframe that resembles what I am working with:
df<-data.frame(Ind=c("C","C","C","B","B"),
V1 = c("a",NA,"b","a",NA),
V2 = c("b","b",NA,"a","a"),
V3 = c(NA,"a","a",NA,"b"),
obs.id = c(1,2,3,4,5))
> df
Ind V1 V2 V3 obs.id
1 C a b <NA> 1
2 C <NA> b a 2
3 C b <NA> a 3
4 B a a <NA> 4
5 B <NA> a b 5
I have figured out a how to merge the data using summarise, like so:
> df%>%select_if(function(x) !all(is.na(x)))%>%
add_count(Ind)%>%filter(n>1)%>%
group_by(Ind)%>%
summarise(across(starts_with("V"),~max(., na.rm = TRUE)))
# A tibble: 2 × 4
Ind V1 V2 V3
<chr> <chr> <chr> <chr>
1 B a a b
2 C b b a
But you will notice that in df
Ind C has two different values for V1, but the summarise
command is just returning the "maximum" value, even though that isn't really a relevant descriptor in my data. This is bad and I need to know which individuals have non-identical variables in the different observations when one of them isn't NA. Tidyverse solutions are preferable, but I'm not picky about what the result looks like; it could be a vector of individuals with this problem, or a new column in df
, whatever. I just need to know which individuals have this issue and can't figure out how to test this.
If you just want to flag if any variable has more than one element
library(dplyr)
df %>%
mutate(flag = any(across(starts_with("V"), ~
length(na.omit(unique(.x))) > 1)), .by = Ind)
Ind V1 V2 V3 obs.id flag
1 C a b <NA> 1 TRUE
2 C <NA> b a 2 TRUE
3 C b <NA> a 3 TRUE
4 B a a <NA> 4 FALSE
5 B <NA> a b 5 FALSE
Or, if all variables should be flagged
library(dplyr)
df %>%
mutate(across(starts_with("V"), ~
length(na.omit(unique(.x))) > 1, .names="{.col}_flag"), .by = Ind)
Ind V1 V2 V3 obs.id V1_flag V2_flag V3_flag
1 C a b <NA> 1 TRUE FALSE FALSE
2 C <NA> b a 2 TRUE FALSE FALSE
3 C b <NA> a 3 TRUE FALSE FALSE
4 B a a <NA> 4 FALSE FALSE FALSE
5 B <NA> a b 5 FALSE FALSE FALSE
Getting all elements
library(dplyr)
df %>%
reframe(across(starts_with("V"), ~
toString(na.omit(unique(.x)))), .by = Ind)
Ind V1 V2 V3
1 C a, b b a
2 B a a b