Generating a table of unique substrings in R

So I have a very large dataset and I want to know the unique values for a column that has about 400,000 observations and each one looks like this: identifier:abzcd:def:RANDOMNUMBERSTRING and identifier:de:ghijklm:RANDOMNUMBERSTRING. I want only unique matches for the section before the random number string. In other words, I only want to filter out duplicates for the code: identifier:LETTERS:LETTERS The unique function doesn't work and it looks like I need to know exactly what substrings I want to filter or how long the substrings are to use the substr function. Any suggestions about how I can do this?

Below is some data that can be used as a model:

randz <- data.frame(id =
                      sprintf("identifier:%s%s%s:%s%s%s:%s",
                 sample(letters, 1000,replace = T ),
                 sample(letters, 1000,replace = T ),
                 sample(letters, 1000,replace = T ),
                 sample(letters, 1000,replace = T ),
                 sample(letters, 1000,replace = T ),
                 sample(letters, 1000,replace = T ),
                 sample(6000:7000, 1000, replace = T )))
randz

Solution

Here is one simple approach using the tidyverse

# Fake Data
randz <- data.frame(id =
                      sprintf("identifier:%s%s%s:%s%s%s:%s",
                 sample(letters, 1000,replace = T ),
                 sample(letters, 1000,replace = T ),
                 sample(letters, 1000,replace = T ),
                 sample(letters, 1000,replace = T ),
                 sample(letters, 1000,replace = T ),
                 sample(letters, 1000,replace = T ),
                 sample(6000:7000, 1000, replace = T )))

Here I will use the str_remove function to remove the numbers (\d+) after the last colon (:) using the "$" to indicate the end of the string. Count will also pull each unique and the column "n" will indicate how many times it will appear.


# Libraries
library(tidyverse)
randz %>% 
  mutate(out = str_remove(string = id,
                           pattern = ":\\d+$")) %>% 
  count(out,sort = TRUE)

Output:

A tibble: 1,000 x 2
   out                    n
   <chr>              <int>
 1 identifier:aar:muk     1
 2 identifier:abe:tlo     1
 3 identifier:abg:qux     1
 4 identifier:abh:bxx     1
 5 identifier:abl:vdj     1