Search code examples
runique

Best way to count unique element in a string in r


I'm a still a beginner in R and I have a question!

I have data frame of 222.000 observations and I'm interesting by a specific column which name is id. The problem is it can be further ids separate by a ',' in the same string and I want to count unique element in a each string (I mean in each string of the first data frame). For example:

      id                       results

0000001,0000003                   2

0000002,0000002                   1

0010001,0001006,0010001           2

I have used the function 'str_split_fixed' to separate all id in the same string and I put the result in a new data frame(so know I have only 1 id by string or nothing in a string). The problem is that can be as many as 68 ',' so the new data frame is huge with 68columns and 220.000 observations and it take much time(15 secondes maybe). After a used a apply function to know all unique.

Does someone know a more efficient way or have an idea?

Finally, I used the following code:

sapply(id, function(x) 
           length(    # count items
             unique(   # that are unique
                scan(   # when arguments are presented to scan as text 
                      text=x, what="", sep =",",  # when separated by ","
                      quiet=TRUE)))  )

But there is a message error:

Error in textConnection(text, encoding = "UTF-8") : 
  argument 'text' incorrect 
6 textConnection(text, encoding = "UTF-8") 
5 scan(text = x, what = "", sep = ",", quiet = TRUE) 
4 unique(scan(text = x, what = "", sep = ",", quiet = TRUE)) 
3 FUN(X[[i]], ...) 
2 lapply(X = X, FUN = FUN, ...) 
1 sapply(id, function(x) length(unique(scan(text = x, 
    what = "", sep = ",", quiet = TRUE)))) 

My R version is:

 R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)

locale:
[1] fr_FR.UTF-8/fr_FR.UTF-8/fr_FR.UTF-8/C/fr_FR.UTF-8/fr_FR.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] stringr_1.0.0 plyr_1.8.3   

loaded via a namespace (and not attached):
[1] magrittr_1.5  tools_3.2.2   Rcpp_0.12.2   stringi_1.0-1
> 

I've tried this: Encoding(id) <- "UTF-8" But the result is:

Error in `Encoding<-`(`*tmp*`, value = "UTF-8")  

and the output of dput(id) is from this:

   [9987,] "2320212,2320230"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
  [9988,] "4530090,4530917"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
  [9989,] "8532412"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
  [9990,] "4560292"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
  [9991,] "4540375"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
  [9992,] "3311324"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
  [9993,] "4540030"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
  [9994,] "9010000"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
  [9995,] "2811810"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
  [9996,] "3311000"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
  [9997,] "4540030"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
  [9998,] "4540215"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
  [9999,] "1541201"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
 [10000,] "2423810"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
 [ getOption("max.print") est atteint -- 90000 lignes omises ]

the ouput is huge so I post just the end and the first line:

 [9002,] "9460000"   

and for dput( head(data$id) ):

"9460000,9433000", "9460000,9436000", "9460000,9437000", 
"9510000", "9510010", "9510030", "9510090", "9910000", "9910020", 
"9910040", "9910090", "D", "FIELD_NOT_FOUND", "I"), class = "factor")  

Thanks in advance, Jef


Solution

  • sapply(id, function(x) 
               length(    # count items
                 unique(   # that are unique
                    scan(   # when arguments are presented to scan as text 
                          text=x, what="", sep =",",  # when separated by ","
                          quiet=TRUE)))  )
    # --- result: first typed line is 'names' of the items, not the results.
        1 2,3,4   1,1 
        1     3     1 
    

    The argument text=x should allow scan to accept a character element of length-1 and break it into components at divisions of the separator argument value. These will get passed element-by-element to the anonymous function from the id vector(or row by row if it were coming from a dataframe).