Search code examples
rdplyrtidyversedata-manipulationtibble

Create new column based on presence/absence of string in other column by group


I have this dataset about vessels locations, where the same "id" can correspond to two levels. Corresponds to a defined category, such as "fishing" and may also appear as "unspecified". I would like to create a new column, where whenever an "id" appears as "unspecified" and also as another category (in other rows), "unspecified" is replaced by that category.

#dataset example
library(dplyr)

levels <- c("passenger", "passenger", "unspecified", "passenger", "passenger",
            "passenger", "passenger", "passenger", "passenger", "passenger",
            "unspecified", "passenger", "fishing", "unspecified", "fishing", 
            "fishing", "fishing","unspecified", "fishing", "fishing", 
            "unspecified","fishing", "fishing", "fishing", "unspecified",
            "unspecified", "unspecified")
id <- c("844", "844", "844", "844", "844","844", "844", "844", "844", "844",
        "844", "844", "845", "845", "845", "845", "845","845", "845", "845", 
        "845","845", "845", "845", "825", "825", "825")
lat <- c(-30.6456, -29.5648, -27.6667, -31.5587, -30.6934, -29.3147, -23.0538, 
         -26.5877, -26.6923, -23.40865, -23.1143, -23.28331, -31.6456, -24.5648, 
         -27.6867, -31.4587, -30.6784, -28.3447, -23.0466, -27.5877, -26.8524, 
         -23.8855, -24.1143, -23.5874, -23.5259, -22.8788, -22.1324)
long <- c(-50.4879, -49.8715, -51.8716, -50.4456, -50.9842, -51.9787, -41.2343, 
          -40.2859, -40.19599, -41.64302, -41.58042, -41.55057, -50.4576, -48.8715, 
          -51.4566, -51.4456, -50.4477, -50.9937, -41.4789, -41.3859, -40.2536, 
          -41.6502, -40.5442, -41.4057, -40.4058, -42.4877, -41.4545)

df <- tibble(levels = as.factor(levels), id  = as.factor(id), lat, long)

Here is my expected output:

> output %>% print(n = 27)
# A tibble: 27 x 5
   levels      id      lat  long new_colum  
   <fct>       <fct> <dbl> <dbl> <fct>      
 1 passenger   844   -30.6 -50.5 passenger  
 2 passenger   844   -29.6 -49.9 passenger  
 3 unspecified 844   -27.7 -51.9 passenger  
 4 passenger   844   -31.6 -50.4 passenger  
 5 passenger   844   -30.7 -51.0 passenger  
 6 passenger   844   -29.3 -52.0 passenger  
 7 passenger   844   -23.1 -41.2 passenger  
 8 passenger   844   -26.6 -40.3 passenger  
 9 passenger   844   -26.7 -40.2 passenger  
10 passenger   844   -23.4 -41.6 passenger  
11 unspecified 844   -23.1 -41.6 passenger  
12 passenger   844   -23.3 -41.6 passenger  
13 fishing     845   -31.6 -50.5 fishing    
14 unspecified 845   -24.6 -48.9 fishing    
15 fishing     845   -27.7 -51.5 fishing    
16 fishing     845   -31.5 -51.4 fishing    
17 fishing     845   -30.7 -50.4 fishing    
18 unspecified 845   -28.3 -51.0 fishing    
19 fishing     845   -23.0 -41.5 fishing    
20 fishing     845   -27.6 -41.4 fishing    
21 unspecified 845   -26.9 -40.3 fishing    
22 fishing     845   -23.9 -41.7 fishing    
23 fishing     845   -24.1 -40.5 fishing    
24 fishing     845   -23.6 -41.4 fishing    
25 unspecified 825   -23.5 -40.4 unspecified
26 unspecified 825   -22.9 -42.5 unspecified
27 unspecified 825   -22.1 -41.5 unspecified

It's for when the same id is unspecified but it's also belongs some other level. When it's just unspecified it stays that way, when it also belongs to other level replaces unspecified with this one.


Solution

  • You can replace "unspecified" with NA, fill these NA with the previous value in the same id group, and then replace the remaining NA back to "unspecified".

    library(tidyverse)
    
    df %>%
      mutate(new_colum = na_if(levels, "unspecified")) %>%
      group_by(id) %>% 
      fill(new_colum) %>%
      ungroup() %>%
      replace_na(list(new_colum = "unspecified"))
    
    # A tibble: 27 × 5
       levels      id      lat  long new_colum  
       <fct>       <fct> <dbl> <dbl> <fct>      
     1 passenger   844   -30.6 -50.5 passenger  
     2 passenger   844   -29.6 -49.9 passenger  
     3 unspecified 844   -27.7 -51.9 passenger  
     4 passenger   844   -31.6 -50.4 passenger  
     5 passenger   844   -30.7 -51.0 passenger  
     6 passenger   844   -29.3 -52.0 passenger  
     7 passenger   844   -23.1 -41.2 passenger  
     8 passenger   844   -26.6 -40.3 passenger  
     9 passenger   844   -26.7 -40.2 passenger  
    10 passenger   844   -23.4 -41.6 passenger  
    11 unspecified 844   -23.1 -41.6 passenger  
    12 passenger   844   -23.3 -41.6 passenger  
    13 fishing     845   -31.6 -50.5 fishing    
    14 unspecified 845   -24.6 -48.9 fishing    
    15 fishing     845   -27.7 -51.5 fishing    
    16 fishing     845   -31.5 -51.4 fishing    
    17 fishing     845   -30.7 -50.4 fishing    
    18 unspecified 845   -28.3 -51.0 fishing    
    19 fishing     845   -23.0 -41.5 fishing    
    20 fishing     845   -27.6 -41.4 fishing    
    21 unspecified 845   -26.9 -40.3 fishing    
    22 fishing     845   -23.9 -41.7 fishing    
    23 fishing     845   -24.1 -40.5 fishing    
    24 fishing     845   -23.6 -41.4 fishing    
    25 unspecified 825   -23.5 -40.4 unspecified
    26 unspecified 825   -22.9 -42.5 unspecified
    27 unspecified 825   -22.1 -41.5 unspecified