Search code examples
runiquesubstr

Generating a table of unique substrings in R


So I have a very large dataset and I want to know the unique values for a column that has about 400,000 observations and each one looks like this: identifier:abzcd:def:RANDOMNUMBERSTRING and identifier:de:ghijklm:RANDOMNUMBERSTRING. I want only unique matches for the section before the random number string. In other words, I only want to filter out duplicates for the code: identifier:LETTERS:LETTERS The unique function doesn't work and it looks like I need to know exactly what substrings I want to filter or how long the substrings are to use the substr function. Any suggestions about how I can do this?

Below is some data that can be used as a model:

randz <- data.frame(id =
                      sprintf("identifier:%s%s%s:%s%s%s:%s",
                 sample(letters, 1000,replace = T ),
                 sample(letters, 1000,replace = T ),
                 sample(letters, 1000,replace = T ),
                 sample(letters, 1000,replace = T ),
                 sample(letters, 1000,replace = T ),
                 sample(letters, 1000,replace = T ),
                 sample(6000:7000, 1000, replace = T )))
randz

Solution

  • Here is one simple approach using the tidyverse

    # Fake Data
    randz <- data.frame(id =
                          sprintf("identifier:%s%s%s:%s%s%s:%s",
                     sample(letters, 1000,replace = T ),
                     sample(letters, 1000,replace = T ),
                     sample(letters, 1000,replace = T ),
                     sample(letters, 1000,replace = T ),
                     sample(letters, 1000,replace = T ),
                     sample(letters, 1000,replace = T ),
                     sample(6000:7000, 1000, replace = T )))
    

    Here I will use the str_remove function to remove the numbers (\d+) after the last colon (:) using the "$" to indicate the end of the string. Count will also pull each unique and the column "n" will indicate how many times it will appear.

    
    # Libraries
    library(tidyverse)
    randz %>% 
      mutate(out = str_remove(string = id,
                               pattern = ":\\d+$")) %>% 
      count(out,sort = TRUE)
    

    Output:

    A tibble: 1,000 x 2
       out                    n
       <chr>              <int>
     1 identifier:aar:muk     1
     2 identifier:abe:tlo     1
     3 identifier:abg:qux     1
     4 identifier:abh:bxx     1
     5 identifier:abl:vdj     1