Search code examples
rlongitudinal

Deleting/ rows if rows do not have atleast 2 recordings for multiple variable


Question 1) Let's say I have longitudinal data for 4 participants over 4 years i.e. years 0,1,3,4. My goal is to

  1. check if the data has outcome variables (n1) recorded for at least any 2 timepoints.
  2. If only one recording is present then delete; otherwise keep it.
  3. repeat 1) & 2) for multiple outcome variables (m1)

data I have

ID  visit   n1  m1
1   0   5.6 0
1   1   1.5 NA
1   3   0.5 NA
1   4   NA  NA
2   0   6   1
2   1   NA  0
2   3   NA  0
2   4   NA  0
3   0   3.4 0
3   1   2.4 0
3   3   2.5 0
3   4   1   1
4   0   NA  NA
4   1   NA  NA
4   3   NA  NA
4   4   3.3 0

this is what I want

data 1       
ID  visit   n1
1   0   5.6
1   1   1.5
1   3   0.5
1   4   NA
3   0   3.4
3   1   2.4
3   3   2.5
3   4   1

data2        
ID  visit   m1
2   0   1
2   1   0
2   3   0
2   4   0
3   0   0
3   1   0
3   3   0
3   4   1

or this form where we create new variable n12 (0= <2 values present for n1 Vs. 1= >=2 values present for n1) & similarly m12. Later I can delete rows based on the values of these new variables n12 & m12.

ID  visit   n1  m1  n12 m12
1    0   5.6 0  1   0
1    1   1.5 NA 1   0
1    3   0.5 NA 1   0
1    4   NA  NA 1   0
2    0   6   1  0   1
2    1   NA  0  0   1
2    3   NA  0  0   1
2    4   NA  0  0   1
3    0   3.4 0  1   1
3    1   2.4 0  1   1
3    3   2.5 0  1   1
3    4   1   1  1   1
4    0   NA  NA 0   0
4    1   NA  NA 0   0
4    3   NA  NA 0   0
4    4   3.3 0  0   0

I tried Deleting incomplete cases across multiple rows in R studio but the following code gives me '0' observations in mydata as it will delete rows even if single NA is found across rows

mydata = mydata[!mydata$ID %in% mydata[!complete.cases(mydata) ,]$ID, ]

library(plyr)
# counts all the IDs
cnt = count(mydata, "ID")
# Eliminates any ID that doesn't have 2 observations
mydata[mydata$ID %in% cnt[cnt$freq == 2, ]$ID, ]

I also tried the long to wide format this did not work as I guess value in my case are multiple variables

library(dplyr)    
mydata <- mydata %>%
tidyr::spread(key=time, value=value) %>% # reformat to wide
na.omit() %>% # delete cases with missingness on any variable (i.e. any time point)
tidyr::gather(key="time", value="value", -ID) # put it back in long format

New Question 2): how should I code if I want to have only rows with visit =0 value of n1 and at least one other visit (1/3/4) recording for n1? to get data like this:

ID  visit   n1  
1   0   5.6 
1   1   1.5 
1   3   0.5 
1   4   NA  
3   0   3.4 
3   1   2.4 
3   3   2.5 
3   4   1

Kindly suggest R syntax or method to achieve goals Thanks!


Solution

  • dat <- structure(list(
      ID = c("1", "1", "1", "1", "2", "2", "2", "2", "3", "3", "3", "3", "4", "4", "4", "4"),
      visit = c("0", "1", "3", "4", "0", "1", "3", "4", "0", "1", "3", "4", "0", "1", "3","4"),
      n1 = c("5.6", "1.5", "0.5", NA, "6", NA, NA, NA, "3.4","2.4", "2.5", "1", NA, NA, NA, "3.3"),
      m1 = c("0", NA, NA, NA, "1", "0", "0", "0", "0", "0", "0", "1", NA, NA, NA, "0")),
      row.names = 2:17, class = "data.frame")
    
    library(dplyr)
    
    dat %>%
      group_by(ID) %>%
      filter(sum(!is.na(n1)) >= 2) %>%
      assign("data1", ., inherits = TRUE)
    data1
    #> # A tibble: 8 x 4
    #> # Groups:   ID [2]
    #>   ID    visit n1    m1   
    #>   <chr> <chr> <chr> <chr>
    #> 1 1     0     5.6   0    
    #> 2 1     1     1.5   <NA> 
    #> 3 1     3     0.5   <NA> 
    #> 4 1     4     <NA>  <NA> 
    #> 5 3     0     3.4   0    
    #> 6 3     1     2.4   0    
    #> 7 3     3     2.5   0    
    #> 8 3     4     1     1
    
    dat %>%
      group_by(ID) %>%
      filter(sum(!is.na(m1)) >= 2) %>%
      assign("data2", ., inherits = TRUE)
    data2
    #> # A tibble: 8 x 4
    #> # Groups:   ID [2]
    #>   ID    visit n1    m1   
    #>   <chr> <chr> <chr> <chr>
    #> 1 2     0     6     1    
    #> 2 2     1     <NA>  0    
    #> 3 2     3     <NA>  0    
    #> 4 2     4     <NA>  0    
    #> 5 3     0     3.4   0    
    #> 6 3     1     2.4   0    
    #> 7 3     3     2.5   0    
    #> 8 3     4     1     1
    
    dat %>%
      group_by(ID) %>%
      mutate(n12 = ifelse(sum(!is.na(n1)) >= 2, 1, 0)) %>%
      mutate(m12 = ifelse(sum(!is.na(m1)) >= 2, 1, 0))
    #> # A tibble: 16 x 6
    #> # Groups:   ID [4]
    #>    ID    visit n1    m1      n12   m12
    #>    <chr> <chr> <chr> <chr> <dbl> <dbl>
    #>  1 1     0     5.6   0         1     0
    #>  2 1     1     1.5   <NA>      1     0
    #>  3 1     3     0.5   <NA>      1     0
    #>  4 1     4     <NA>  <NA>      1     0
    #>  5 2     0     6     1         0     1
    #>  6 2     1     <NA>  0         0     1
    #>  7 2     3     <NA>  0         0     1
    #>  8 2     4     <NA>  0         0     1
    #>  9 3     0     3.4   0         1     1
    #> 10 3     1     2.4   0         1     1
    #> 11 3     3     2.5   0         1     1
    #> 12 3     4     1     1         1     1
    #> 13 4     0     <NA>  <NA>      0     0
    #> 14 4     1     <NA>  <NA>      0     0
    #> 15 4     3     <NA>  <NA>      0     0
    #> 16 4     4     3.3   0         0     0
    Created on 2021-12-16 by the reprex package (v2.0.1)
    

    Demonstration of behavior:

    # Let's look at one ID
    # group_by() is essentially doing the same thing
    # i.e., summarizing by group
    id1 <- dat[dat$ID==1,]
    id1
    #>   ID visit   n1   m1
    #> 2  1     0  5.6    0
    #> 3  1     1  1.5 <NA>
    #> 4  1     3  0.5 <NA>
    #> 5  1     4 <NA> <NA>
    
    # Test for NA in variable n1
    is.na(id1$n1)
    #> [1] FALSE FALSE FALSE  TRUE
    
    # Identify values that aren't NA
    !is.na(id1$n1)
    #> [1]  TRUE  TRUE  TRUE FALSE
    
    # Count values that aren't NA
    sum(!is.na(id1$n1))
    #> [1] 3
    
    # For the first approach:
    #
    # Filter() keeps values that meet the criteria
    # In this case, we are keeping IDs with more than 2 non-NAs in the focal variable (n1)
    #
    # Create object in the environment based on this dplyr chain
    # named "data2";
    # using data from dplyr chain ".";
    # inherits=TRUE makes it available outside of the chain
    assign(x="data1", value=., inherits=TRUE)
    
    # For the second approach:
    #
    # If the number of non-NAs is greater than 2 (per your criteria),
    # return a 1, otherwise 0, in a new column
    # A result of 1 here indicates this ID has two or more values that are non-NA
    ifelse(test=sum(!is.na(id1$n1)) >= 2, yes=1, no=0)
    #> [1] 1