So I have a very large dataset and I want to know the unique values for a column that has about 400,000 observations and each one looks like this:
identifier:abzcd:def:RANDOMNUMBERSTRING
and identifier:de:ghijklm:RANDOMNUMBERSTRING
.
I want only unique matches for the section before the random number string. In other words, I only want to filter out duplicates for the code: identifier:LETTERS:LETTERS
The unique
function doesn't work and it looks like I need to know exactly what substrings I want to filter or how long the substrings are to use the substr
function. Any suggestions about how I can do this?
Below is some data that can be used as a model:
randz <- data.frame(id =
sprintf("identifier:%s%s%s:%s%s%s:%s",
sample(letters, 1000,replace = T ),
sample(letters, 1000,replace = T ),
sample(letters, 1000,replace = T ),
sample(letters, 1000,replace = T ),
sample(letters, 1000,replace = T ),
sample(letters, 1000,replace = T ),
sample(6000:7000, 1000, replace = T )))
randz
Here is one simple approach using the tidyverse
# Fake Data
randz <- data.frame(id =
sprintf("identifier:%s%s%s:%s%s%s:%s",
sample(letters, 1000,replace = T ),
sample(letters, 1000,replace = T ),
sample(letters, 1000,replace = T ),
sample(letters, 1000,replace = T ),
sample(letters, 1000,replace = T ),
sample(letters, 1000,replace = T ),
sample(6000:7000, 1000, replace = T )))
Here I will use the str_remove
function to remove the numbers (\d+) after the last colon (:) using the "$" to indicate the end of the string. Count will also pull each unique and the column "n" will indicate how many times it will appear.
# Libraries
library(tidyverse)
randz %>%
mutate(out = str_remove(string = id,
pattern = ":\\d+$")) %>%
count(out,sort = TRUE)
Output:
A tibble: 1,000 x 2
out n
<chr> <int>
1 identifier:aar:muk 1
2 identifier:abe:tlo 1
3 identifier:abg:qux 1
4 identifier:abh:bxx 1
5 identifier:abl:vdj 1