Search code examples
rtidyverseidentitymatching

cleaner way to match observations using IDs in different columns


I am using mark-recapture data, with each individual having a PIT tag (Tag) and a genetic sample ID given at the first observation (ID), which is typically used as the ID for the individual (Unified.ID). However, sometimes an animal will lose a PIT tag, and we have to retag it, but if we know its old tag number we record that as Alt_tag.

This sample data is four observations from the same individual, so I am trying to get them all to have the same Unified.ID by matching the Alt_tag in row 4 to the Tag in rows 1:3, so that all 4 rows end up with "CAL163037" in the Unified.ID column.

sample.dat<-tibble(ID=c("CAL163037",NA,NA,NA),
           Tag = c("132800","132800","132800","981930"),
           Date = c("2016-08-23","2017-06-09","2017-06-22","2017-08-23"),
           Alt_tag = c(NA,NA,NA,"132800"),
           Unified.ID = c("CAL163037","CAL163037","CAL163037","981930"))

# A tibble: 4 × 5
  ID        Tag    Date       Alt_tag Unified.ID
  <chr>     <chr>  <chr>      <chr>   <chr>     
1 CAL163037 132800 2016-08-23 NA      CAL163037 
2 NA        132800 2017-06-09 NA      CAL163037 
3 NA        132800 2017-06-22 NA      CAL163037 
4 NA        981930 2017-08-23 132800  981930  

I think I sort of have something close with the code below, but I feel like there is a more elegant way to do this, preferably with some kind of dplyr mutate, so it is easier to check that it is doing what I want it to (I can't get this if_else function working inside of mutate()). My full dataset is pretty big, so I want to be confident that it is working properly across the whole thing. Any ideas of how to clean this up?

> if_else(is.na(sample.dat$Alt_tag)==FALSE,(sample.dat[
which(sample.dat$Tag%in%sample.dat$Alt_tag),"Unified.ID"]%>%distinct()),NA)

# A tibble: 4 × 1
# Groups:   Unified.ID [1]
  Unified.ID
  <fct>     
1 CAL163037 
2 CAL163037 
3 CAL163037 
4 CAL163037 

Solution

  • Probably you can try this?

    sample.dat %>%
        mutate(Unified.Tag = coalesce(Alt_tag, Tag)) %>%
        mutate(Unified.ID = unique(na.omit(ID)), .by = Unified.Tag) %>%
        select(-Unified.Tag)
    

    which gives

    # A tibble: 4 × 5
      ID        Tag    Date       Alt_tag Unified.ID
      <chr>     <chr>  <chr>      <chr>   <chr>
    1 CAL163037 132800 2016-08-23 NA      CAL163037
    2 NA        132800 2017-06-09 NA      CAL163037
    3 NA        132800 2017-06-22 NA      CAL163037
    4 NA        981930 2017-08-23 132800  CAL163037