Question 1) Let's say I have longitudinal data for 4 participants over 4 years i.e. years 0,1,3,4. My goal is to
data I have
ID visit n1 m1
1 0 5.6 0
1 1 1.5 NA
1 3 0.5 NA
1 4 NA NA
2 0 6 1
2 1 NA 0
2 3 NA 0
2 4 NA 0
3 0 3.4 0
3 1 2.4 0
3 3 2.5 0
3 4 1 1
4 0 NA NA
4 1 NA NA
4 3 NA NA
4 4 3.3 0
this is what I want
data 1
ID visit n1
1 0 5.6
1 1 1.5
1 3 0.5
1 4 NA
3 0 3.4
3 1 2.4
3 3 2.5
3 4 1
data2
ID visit m1
2 0 1
2 1 0
2 3 0
2 4 0
3 0 0
3 1 0
3 3 0
3 4 1
or this form where we create new variable n12 (0= <2 values present for n1 Vs. 1= >=2 values present for n1) & similarly m12. Later I can delete rows based on the values of these new variables n12 & m12.
ID visit n1 m1 n12 m12
1 0 5.6 0 1 0
1 1 1.5 NA 1 0
1 3 0.5 NA 1 0
1 4 NA NA 1 0
2 0 6 1 0 1
2 1 NA 0 0 1
2 3 NA 0 0 1
2 4 NA 0 0 1
3 0 3.4 0 1 1
3 1 2.4 0 1 1
3 3 2.5 0 1 1
3 4 1 1 1 1
4 0 NA NA 0 0
4 1 NA NA 0 0
4 3 NA NA 0 0
4 4 3.3 0 0 0
I tried Deleting incomplete cases across multiple rows in R studio but the following code gives me '0' observations in mydata as it will delete rows even if single NA is found across rows
mydata = mydata[!mydata$ID %in% mydata[!complete.cases(mydata) ,]$ID, ]
library(plyr)
# counts all the IDs
cnt = count(mydata, "ID")
# Eliminates any ID that doesn't have 2 observations
mydata[mydata$ID %in% cnt[cnt$freq == 2, ]$ID, ]
I also tried the long to wide format this did not work as I guess value in my case are multiple variables
library(dplyr)
mydata <- mydata %>%
tidyr::spread(key=time, value=value) %>% # reformat to wide
na.omit() %>% # delete cases with missingness on any variable (i.e. any time point)
tidyr::gather(key="time", value="value", -ID) # put it back in long format
New Question 2): how should I code if I want to have only rows with visit =0 value of n1 and at least one other visit (1/3/4) recording for n1? to get data like this:
ID visit n1
1 0 5.6
1 1 1.5
1 3 0.5
1 4 NA
3 0 3.4
3 1 2.4
3 3 2.5
3 4 1
Kindly suggest R syntax or method to achieve goals Thanks!
dat <- structure(list(
ID = c("1", "1", "1", "1", "2", "2", "2", "2", "3", "3", "3", "3", "4", "4", "4", "4"),
visit = c("0", "1", "3", "4", "0", "1", "3", "4", "0", "1", "3", "4", "0", "1", "3","4"),
n1 = c("5.6", "1.5", "0.5", NA, "6", NA, NA, NA, "3.4","2.4", "2.5", "1", NA, NA, NA, "3.3"),
m1 = c("0", NA, NA, NA, "1", "0", "0", "0", "0", "0", "0", "1", NA, NA, NA, "0")),
row.names = 2:17, class = "data.frame")
library(dplyr)
dat %>%
group_by(ID) %>%
filter(sum(!is.na(n1)) >= 2) %>%
assign("data1", ., inherits = TRUE)
data1
#> # A tibble: 8 x 4
#> # Groups: ID [2]
#> ID visit n1 m1
#> <chr> <chr> <chr> <chr>
#> 1 1 0 5.6 0
#> 2 1 1 1.5 <NA>
#> 3 1 3 0.5 <NA>
#> 4 1 4 <NA> <NA>
#> 5 3 0 3.4 0
#> 6 3 1 2.4 0
#> 7 3 3 2.5 0
#> 8 3 4 1 1
dat %>%
group_by(ID) %>%
filter(sum(!is.na(m1)) >= 2) %>%
assign("data2", ., inherits = TRUE)
data2
#> # A tibble: 8 x 4
#> # Groups: ID [2]
#> ID visit n1 m1
#> <chr> <chr> <chr> <chr>
#> 1 2 0 6 1
#> 2 2 1 <NA> 0
#> 3 2 3 <NA> 0
#> 4 2 4 <NA> 0
#> 5 3 0 3.4 0
#> 6 3 1 2.4 0
#> 7 3 3 2.5 0
#> 8 3 4 1 1
dat %>%
group_by(ID) %>%
mutate(n12 = ifelse(sum(!is.na(n1)) >= 2, 1, 0)) %>%
mutate(m12 = ifelse(sum(!is.na(m1)) >= 2, 1, 0))
#> # A tibble: 16 x 6
#> # Groups: ID [4]
#> ID visit n1 m1 n12 m12
#> <chr> <chr> <chr> <chr> <dbl> <dbl>
#> 1 1 0 5.6 0 1 0
#> 2 1 1 1.5 <NA> 1 0
#> 3 1 3 0.5 <NA> 1 0
#> 4 1 4 <NA> <NA> 1 0
#> 5 2 0 6 1 0 1
#> 6 2 1 <NA> 0 0 1
#> 7 2 3 <NA> 0 0 1
#> 8 2 4 <NA> 0 0 1
#> 9 3 0 3.4 0 1 1
#> 10 3 1 2.4 0 1 1
#> 11 3 3 2.5 0 1 1
#> 12 3 4 1 1 1 1
#> 13 4 0 <NA> <NA> 0 0
#> 14 4 1 <NA> <NA> 0 0
#> 15 4 3 <NA> <NA> 0 0
#> 16 4 4 3.3 0 0 0
Created on 2021-12-16 by the reprex package (v2.0.1)
Demonstration of behavior:
# Let's look at one ID
# group_by() is essentially doing the same thing
# i.e., summarizing by group
id1 <- dat[dat$ID==1,]
id1
#> ID visit n1 m1
#> 2 1 0 5.6 0
#> 3 1 1 1.5 <NA>
#> 4 1 3 0.5 <NA>
#> 5 1 4 <NA> <NA>
# Test for NA in variable n1
is.na(id1$n1)
#> [1] FALSE FALSE FALSE TRUE
# Identify values that aren't NA
!is.na(id1$n1)
#> [1] TRUE TRUE TRUE FALSE
# Count values that aren't NA
sum(!is.na(id1$n1))
#> [1] 3
# For the first approach:
#
# Filter() keeps values that meet the criteria
# In this case, we are keeping IDs with more than 2 non-NAs in the focal variable (n1)
#
# Create object in the environment based on this dplyr chain
# named "data2";
# using data from dplyr chain ".";
# inherits=TRUE makes it available outside of the chain
assign(x="data1", value=., inherits=TRUE)
# For the second approach:
#
# If the number of non-NAs is greater than 2 (per your criteria),
# return a 1, otherwise 0, in a new column
# A result of 1 here indicates this ID has two or more values that are non-NA
ifelse(test=sum(!is.na(id1$n1)) >= 2, yes=1, no=0)
#> [1] 1