Search code examples
rdplyrreplacemutate

dplyr eliminate character value in a string inside a column variable without leaving "" traces in the string


I have a dataset with observations on individuals that have clear group membership, and sometimes, in that observation that is filled in the column "individuals", I have some individuals that get mixed with those that have clear membership. I would like to get rid of these individuals but when I did it I ended having "" instead of their names. Could anyone help me remove their names and leave no trace in the string of characters that compose the column of individuals?

These individuals beling to the following groups:

groupA = c("Noir", "Bleue", "Rouge")  
groupB = c("Dion", "Saphir", "Chapman")  
groupC= c("Murray", "Nile", "Mississippi")  

My data looks like this:

group  date         time   individuals  
A      1/1/2016     9:00   "Noir", "Bleue", "Rouge"  
B      1/1/2016     9:00   "Dion", "Saphir", "Chapman"  
C      1/1/2016     9:00   "Murray", "Nile", Mississippi"  

These cases are OK because the individuals are belonging to the group, but sometimes, I have some extra individuals that have no group membership that are interspersed with the groups that do have clear membership, like this:

My data looks like this, where 3 individuals that are unknown (InconnuA, InconnuB, Inconnu1) are mixed.

group  date         time   individuals  
A      2/1/2016     9:00   "Noir", "Bleue", "InconnuA"  
B      2/1/2016     9:00   "Dion", "Saphir", "InconnuB"  
C      2/1/2016     9:00   "Murray", "Nile", Inconnu1"  

I would like to remove the individuals, and the function below works well, but then, in the dataset that results from it I have undesired "" in the place where the unknown individuals I wanted to remove were.

IndividualsRemoved <- partycompfocal_GroupingID %>%   

mutate(across("individuals", str_replace, "InconnuA", ""),  
       across("individuals", str_replace, "InconnuB", ""),  
       across("individuals", str_replace, "Inconnu1", ""),  
       across("individuals", str_replace, "Inconnu2", ""),  
       across("individuals", str_replace, "Inconnu3", ""),  
       )

So in my datafile after the change I would have this:

 group  date         time   individuals  
 A      2/1/2016     9:00   "Noir", "Bleue", ""  
 B      2/1/2016     9:00   "Dion", "Saphir", ""  
 C      2/1/2016     9:00   "Murray", "Nile", "  

Could anyone help me remove the "" from the column individuals so it looks like this in the end?

group  date         time   individuals  
A      2/1/2016     9:00   "Noir", "Bleue"  
B      2/1/2016     9:00   "Dion", "Saphir"  
C      2/1/2016     9:00   "Murray", "Nile"  

Many thanks


Solution

  • From what I gather from the question, you are trying to remove names from long strings of individuals (?) which are not a part of predefined vectors using stringr, and not keep around things like "". There are a couple of approaches you could take:

    1. Turn the individuals column into a list column, if it isn't already, and then remove ones which aren't in the corresponding group member list, using map(), or
    2. make each value in individuals be it's own column, using separate_longer(), then use filter to accomplish the same result.

    Below is the first option:

    library(tidyverse) 
    
    df <- tibble(
      group = c("A", "B", "C"),
      date = as.Date(c("1/1/2016", "1/1/2016", "1/1/2016"), format = "%d/%m/%Y"), # unclear if you are using day month year, or month day year
      time = hms(paste0(c("9:00", "9:00", "9:00"), ":00")),
      individuals = c('"Noir", "Bleue", "InconnuA"',
                      '"Dion", "Saphir", "InconnuB"',
                      '"Murray", "Nile", "InconnuC"')) # note that each row's value is a string
    
    group_df <- tibble(group = c("A", "B", "C"), individuals = list(groupA, groupB, groupC))
    
    df |> 
      mutate(individuals = str_extract(individuals, "(?<=^\").+(?=\"$)") |> str_split("\", \"")) |> # basically, remove the first and last apostrophes, and then split on '", "'
      left_join(group_df, by = "group") |>
      mutate(
        individuals = map2(individuals.x, individuals.y, ~ .x[.x %in% .y])) |>
        select(-individuals.x, -individuals.y)
    

    Output:

    # A tibble: 3 × 4
      group date       time     individuals
      <chr> <date>     <Period> <list>     
    1 A     2016-01-01 9H 0M 0S <chr [2]>  
    2 B     2016-01-01 9H 0M 0S <chr [2]>  
    3 C     2016-01-01 9H 0M 0S <chr [2]>