Search code examples
rlistdataframeunique

R: split variables from data frame and find unique ones


I have a tibble with 28 rows:

> al
# A tibble: 28 x 1
   lang_name                                               
   <chr>                                                   
 1 Objective-C,Swift,Other                                 
 2 Ruby,Shell                                              
 3 Ruby,HTML,Shell                                         
 4 Java,HTML,Kotlin,Other                                  
 5 TypeScript,JavaScript,CSS,Inno Setup,Shell,HTML         
 6 Vue,JavaScript,CSS,HTML                                 
 7 HTML,JavaScript,CSS                                     
 8 JavaScript,HTML,CSS,Other                               
 9 NA                                                      
10 Vim script,Ruby,Shell,Python,CoffeeScript,Makefile,Other
# ... with 18 more rows

Whicy I got by slicing the other data frame with al <- gh[,'lang_name']. I want to extract data from every row and place it all in a single list, so I can find unique values.

How do I do that?

I have tried splitting with al <- str_split(al, ","), but it returns the following list:

[[1]]
  [1] "c(\"Objective-C"  "Swift"            "Other\""          " \"Ruby"         
  [5] "Shell\""          " \"Ruby"          "HTML"             "Shell\""         
  [9] " \"Java"          "HTML"             "Kotlin"           "Other\""         
 [13] " \"TypeScript"    "JavaScript"       "CSS"              "Inno Setup"      
 [17] "Shell"            "HTML\""           " \"Vue"           "JavaScript"      
 [21] "CSS"              "HTML\""           " \"HTML"          "JavaScript"      
 [25] "CSS\""            " \"JavaScript"    "HTML"             "CSS"             
 [29] "Other\""          " NA"              " \"Vim script"    "Ruby"            
 [33] "Shell"            "Python"           "CoffeeScript"     "Makefile"        
 [37] "Other\""          " \"PHP\""         " \"JavaScript"    "TypeScript"      
 [41] "Other\""          " \"JavaScript"    "Other\""          " \"JavaScript"   
 [45] "CSS"              "Shell\""          " \"Ruby"          "JavaScript"      
 [49] "HTML"             "Vue"              "CSS"              "Shell\""         
 [53] " \"Go"            "Assembly"         "HTML"             "C"               
 [57] "Shell"            "Perl\""           " \"Go"            "HCL"             
 [61] "Other\""          " \"JavaScript\""  " \"C++"           "JavaScript"      
 [65] "Python"           "Go"               "Shell"            "C\""             
 [69] " \n\"JavaScript"  "CSS"              "HTML"             "Other\""         
 [73] " \"C++"           "Cuda"             "C"                "CMake"           
 [77] "Java"             "Python"           "Other\""          " \"JavaScript"   
 [81] "GLSL\""           " \"JavaScript"    "TypeScript"       "CSS\""           
 [85] " \"Kotlin"        "C"                "Makefile"         "HTML"            
 [89] "C++"              "Java"             "Other\""          " \"Java"         
 [93] "Other\""          " \"Python"        "Jupyter Notebook" "C++"             
 [97] "HTML"             "Shell"            "JavaScript\""     " \"CSS"          
[101] "JavaScript"       "HTML"             "Other\""          " \"HTML"         
[105] "CSS"              "JavaScript\")"   

And unique(al) simply returns the same string.

I have also tried to put it all as a character:

al <- gh[1,'lang_name']
i = 2
while(i < nrow(gh)) {
    al <- paste(al, ",", gh[i+1,'lang_name'])
    i = i + 1
  }
}

Which results in the following character: [1] "Objective-C,Swift,Other , Ruby,HTML,Shell , Java,HTML,Kotlin,Other , TypeScript,JavaScript,CSS,Inno Setup,Shell,HTML , Vue,JavaScript,CSS,HTML , HTML,JavaScript,CSS , JavaScript,HTML,CSS,Other , NA , Vim script,Ruby,Shell,Python,CoffeeScript,Makefile,Other , PHP , JavaScript,TypeScript,Other , JavaScript,Other , JavaScript,CSS,Shell , Ruby,JavaScript,HTML,Vue,CSS,Shell , Go,Assembly,HTML,C,Shell,Perl , Go,HCL,Other , JavaScript , C++,JavaScript,Python,Go,Shell,C , JavaScript,CSS,HTML,Other , C++,Cuda,C,CMake,Java,Python,Other , JavaScript,GLSL , JavaScript,TypeScript,CSS , Kotlin,C,Makefile,HTML,C++,Java,Other , Java,Other , Python,Jupyter Notebook,C++,HTML,Shell,JavaScript , CSS,JavaScript,HTML,Other , HTML,CSS,JavaScript"

Which I don't know how to convert into string to run unique on.


Solution

  • If you like tidyverse/purrr functions, you can do this in one piped step. stringr::str_split is a convenient wrapper around stringi::stri_split. purrr::reduce lets you apply a function, in this case c, repeatedly until you have the entire list of vectors that was returned by str_split reduced into one character vector. unlist from base R also works well in place of reduce—I have very purrr-focused habits with tasks like this, but that doesn't need to be the default for a simple task.

    library(tidyverse)
    
    al$lang_name %>%
      str_split(",") %>%
      reduce(c) %>%
      unique()
    #>  [1] "Objective-C"  "Swift"        "Other"        "Ruby"        
    #>  [5] "Shell"        "HTML"         "Java"         "Kotlin"      
    #>  [9] "TypeScript"   "JavaScript"   "CSS"          "Inno Setup"  
    #> [13] "Vue"          NA             "Vim script"   "Python"      
    #> [17] "CoffeeScript" "Makefile"
    

    Created on 2018-06-03 by the reprex package (v0.2.0).