Search code examples
rdataframerandomunique

R - map vector of unique values to dataframe column with duplicates


I have a column in a dataframe that is a character vector. I would like to add to my dataframe a column containing unique ID values/codes corresponding to each unique value in said column. Here is some toy data:

fnames <- c("joey", "joey", "joey", "jimmy", "jimmy", "tommy", "michael", "michael", "michael", "michael", "michael", "kevin", "kevin", "christopher", "aaron", "joshua", "joshua", "joshua", "arvid", "aiden", "kentavious", "lawrence", "xavier")

names <- as.data.frame(fnames)

To get the number of unique values of fnames I run:

unique_fnames <- length(unique(names$fnames))

To generate unique IDs for each unique name, I found the following function:

create_unique_ids <- function(n, seed_no = 16169, char_len = 6){
  set.seed(seed_no)
  pool <- c(letters, LETTERS, 0:9)
  
  res <- character(n)
  for(i in seq(n)){
    this_res <- paste0(sample(pool, char_len, replace = TRUE), collapse = "")
    while(this_res %in% res){
      this_res <- paste0(sample(pool, char_len, replace = TRUE), collapse = "")
    }
    res[i] <- this_res
  }
  res
}

Applying create_unique_ids to unique_fnames I get the desired number of ID codes:

unique_fname_id <- create_unique_ids(unique_fnames)

My question is this:

How do I add the vector of unique_fname_id to my dataframe names? The desired result is a dataframe names with a unique_fname_id column that looks something like this:

unique_fname_id <- c("VvWMKt", "VvWMKt", "VvWMKt", "yEbpFq", "yEbpFq", "Z3xCdO"...)

where "VvWMKt" corresponds to "joey", "yEbpFq" corresponds to "jimmy" and so on. The dataframe names would be the same length as the original, just with this added column.

Is there a way to do this? All suggestions are welcome and appreciated. Thanks!

Edit: I need to keep the set.seed in the create_unique_ids function to ensure the IDs generated can be reproduced continuously.


Solution

  • If you want to use your function and keep the seed, you can do:

    names %>% 
      distinct(fnames) %>% 
      bind_cols(unique_ID = create_unique_ids(13)) %>% 
      left_join(names)
    

    You can also remove the seed (the set.seed(seed_no) line and parameter) from your function and have a simpler solution:

    names %>% 
      group_by(fnames) %>% 
      mutate(unique_ID = create_unique_ids(1))
    
       fnames  unique_ID
       <chr>   <chr>    
     1 joey    ea10KC   
     2 joey    ea10KC   
     3 joey    ea10KC   
     4 jimmy   MD5W4d   
     5 jimmy   MD5W4d   
     6 tommy   xR7ozW   
     7 michael uuGn3h   
     8 michael uuGn3h   
     9 michael uuGn3h   
    10 michael uuGn3h   
    # ... with 13 more rows
    

    You can also use a built-in function like stringi::stri_rand_strings, which creates random alphanumerical strings with a fixed number of characters:

    library(stringi); library(dplyr)
    
    names %>% 
      group_by(fnames) %>% 
      mutate(unique_ID = stri_rand_strings(1, 6))