Search code examples
rstringunique

Count comma separated unique values in a string


The first two columns of dataframe make a composite key and there's a column of type char which contains comma separated integers. My objective is to make a column which contains the count of unique integers in the string. I know the approach of converting string to columns using str_split_fixed and then counting the unique values but due to the length of string a large number of columns are added and everything lags. Is there any other method? The actual data set contains 500k rows and 53 columns. Sample dataset :
df

c1      c2    c3  
aa      11   1,13,4,5,4,7,9    
bb      22   2,5,2,4,5,7,11,     
cc      33   11,14,3,1,    
dd      44   1,1,2,4,5,6,15,    
ee      55   4,3,3,1,14,17,

desired output:

c1        c2             c3             c4  
------ | ------   | ------          | -----   
aa     | 11       | 1,13,4,5,4,7,9  |  6    
------ | ------   | ------          | -----   
bb     | 22       | 2,5,2,4,5,7,11, |  5   
------ | ------   | ------          | -----   
cc     | 33       | 11,14,3,1,      |  4   
------ | ------   | ------          | -----   
dd     | 44       | 1,1,2,4,5,6,15, |  6       
------ | ------   | ------          | -----   
ee     | 55       | 4,3,3,1,7,17,7, |  5    
------ | ------   | ------          | -----  

Any help would be appreciated!


Solution

  • We can use stri_extract to extract all the numbers, then loop through the list, find the length of unique elements

    library(stringi)
    df1$Count <- sapply(stri_extract_all_regex(df1$col3, "[0-9]+"), 
                         function(x) length(unique(x)))