Search code examples
runiquesubsetdata-manipulation

Is there a simple way to subset unique values and their associated datum?


I have a large data set that contains ~300,000 rows and 60 columns. Currently if I want to subset the unique characteristics within my one variable I use the unique() function to create a data.frame list of all the unique values in that variable. I then match it to the master data frame to get the associated data from my master file.

This process however is a little cumbersome, so I was wondering if there is a faster way to do the same thing? for example, is there a function that I could use to select for unique fields and the associated data that is connected to those values?

For example: I would like to make a new data frame the contains only unique SurveyID_Block ID and their associated island code and abundances.

 structure(list(SurveyID_Block = c("62003713_2", "62003087_2", 
"62003713_2", "62003713_2", "62003713_1", "62003713_2", "62003713_1", 
"62003713_2", "62003713_2", "62003087_1", "62003713_1", "62003713_1", 
"62003713_2", "62003713_2", "62003713_1", "62003087_1", "62003087_2", 
"62003713_2", "62003713_2", "62003713_2", "62003087_2", "62003713_2", 
"62003713_1", "62003713_1", "62003713_1", "62003713_1", "62003713_2", 
"62003713_1", "62003713_2", "62003087_1", "62003713_2", "62003087_1", 
"62003713_1", "62003087_2", "62003087_2", "62003713_2", "62003713_1", 
"62003087_1", "62003713_1", "62003713_1", "62003713_1", "62003087_2", 
"62003087_2", "62003713_2", "62003713_2", "62003713_2", "62003713_1", 
"62003087_1", "62003713_2", "62003087_2", "62003713_1", "62003713_1", 
"62003713_2", "62003713_1", "62003713_2", "62003087_2", "62003087_2", 
"62003087_1", "62003087_1", "62003713_1", "62003087_1", "62003087_1", 
"62003087_2", "62003087_2", "62003713_2", "62003713_1", "62003713_2", 
"62003713_2", "62003713_2", "62003713_1", "62003713_2", "62003087_1", 
"62003713_1", "62003713_1", "62003087_1", "62003087_1", "62003713_1", 
"62003087_2", "62003087_1", "62003087_2", "62003087_2", "62003087_1", 
"62003087_1", "62003087_1", "62003713_2", "62003087_2", "62003713_2", 
"62003087_2", "62003713_1", "62003713_1", "62003087_2", "62003087_1", 
"62003087_1", "62003087_1", "62003713_2", "62003713_2", "62003087_1", 
"62003713_1", "62003087_1", "62003087_2"), IslandCode = c(1391L, 
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 
1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L, 1391L
), totalAbun = c(667L, 174L, 667L, 667L, 715L, 667L, 715L, 667L, 
667L, 1365L, 715L, 715L, 667L, 667L, 715L, 1365L, 174L, 667L, 
667L, 667L, 174L, 667L, 715L, 715L, 715L, 715L, 667L, 715L, 667L, 
1365L, 667L, 1365L, 715L, 174L, 174L, 667L, 715L, 1365L, 715L, 
715L, 715L, 174L, 174L, 667L, 667L, 667L, 715L, 1365L, 667L, 
174L, 715L, 715L, 667L, 715L, 667L, 174L, 174L, 1365L, 1365L, 
715L, 1365L, 1365L, 174L, 174L, 667L, 715L, 667L, 667L, 667L, 
715L, 667L, 1365L, 715L, 715L, 1365L, 1365L, 715L, 174L, 1365L, 
174L, 174L, 1365L, 1365L, 1365L, 667L, 174L, 667L, 174L, 715L, 
715L, 174L, 1365L, 1365L, 1365L, 667L, 667L, 1365L, 715L, 1365L, 
174L)), .Names = c("SurveyID_Block", "IslandCode", "totalAbun"
), row.names = c(NA, 100L), class = "data.frame")

Solution

  • We can split the dataset by 'SurveyID_Block' to create a list of data.frames. It would be better to keep the datasets in a list rather than creating individual data.frame objects in the global environment.

    lst <- split(df1, df1$SurveyID_Block)
    

    But, if we need to create individual datasets, it can be done with list2env

    list2env(setNames(lst, paste0('dfN', seq_along(lst))),
              envir=.GlobalEnv)