Search code examples
rdata-manipulation

Selecting Entries in a Data Frame Stored in a List


I have the following datasets:

my_data = data.frame(col1 = c("abc", "bcd", "bfg", "eee", "eee") , id = 1:5)
my_data_1 = data.frame(col1 = c("abc", "byd", "bgg", "fef", "eee") , id = 1:5)

I defined an object as follows:

unique_vector = unique(my_data_1[c("col1"),])

I want to select all rows in "my_data" in which "col1" contains any value within "unique_vector":

output <- my_data[which(my_data$col1 %in% unique_vector ), ]

But this is returning an empty selection:

[1] col1 id  
<0 rows> (or 0-length row.names)

Is there another way to do this in R?

Thank you!

Note: The standard way to do this is like this:

> as.list(unique_vector)
$col1
[1] "abc" "byd" "bgg" "fef" "eee"
output <-  my_data[which(my_data$col1 %in% c("abc", "byd" ,"bgg",  "fef", "eee") ), ]

But I am looking for a shortcut in which I don't have to manually type out everything.


Solution

  • You are trying to subset one data.frame to rows that match the unique values of a column in another data.frame.

    Your attempted solution returns no elements because unique is a data.frame and when you coerce it to a list you are stuck with a list instead of a vector that can be used to subset rows. When subsetting using foo[bar, ], bar should be a vector either with the indices of the rows to keep (e.g. foo[c(1,2), ] or a logical value for each index in the data.frame. All you need to do is use %in% with the vector of unique values itself.

    You don't need to use list() for this and which() is redundant since you can subset the data.frame using a logical vector instead of row indices. The logic behind this latter point is that %in% is returning TRUE or FALSE for each row of my_data, which can be used to subset. All that which() is doing is getting the indices of rows that are TRUE and subsetting by index. However, that is entirely redundant.

    # Your example data
    my_data = data.frame(col1 = c("abc", "bcd", "bfg", "eee", "eee") , id = 1:5)
    my_data_1 = data.frame(col1 = c("abc", "byd", "bgg", "fef", "eee") , id = 1:5)
    unique = unique(my_data_1[c("col1")])
    
    # Show that unique is a data.frame
    str(unique)
    #> 'data.frame':    5 obs. of  1 variable:
    #>  $ col1: chr  "abc" "byd" "bgg" "fef" ...
    
    # Show that unique$col1 is a vector
    str(unique$col1)
    #>  chr [1:5] "abc" "byd" "bgg" "fef" "eee"
    
    # Show what a logical test with the character vector does
    my_data$col1 %in% unique$col1
    #> [1]  TRUE FALSE FALSE  TRUE  TRUE
    
    # We can use this to subset
    my_data[my_data$col1 %in% unique$col1, ]
    #>   col1 id
    #> 1  abc  1
    #> 4  eee  4
    #> 5  eee  5
    

    You could also combine steps and simply use:

    my_data[my_data$col1 %in% unique(my_data_1$col1), ]
    #>   col1 id
    #> 1  abc  1
    #> 4  eee  4
    #> 5  eee  5