Search code examples
rdataframeloopsset-difference

Create a dataframe that is the result of a difference between vectors from other dataframes R


I have the following datasets and information: first, I have i different plots that I want to analyze. In each plot, i have j species that I want to obtain some information, such as:

plot1 = c(rep(1, 3), rep(2, 4), rep(3, 5))
spp1 = c('a', 'b', 'c', 'a', 'b', 'c', 'd', 'b', 'b', 'b', 'e', 'f')
data.1 = data.frame(plot1, spp1)

The above mentioned information repeats for a second dataframe of similar structure:

plot2 = c(rep 1, 2), rep(2, 3), rep(3, 5))
spp2 = c('a', 'a', 'b', 'c', 'c', 'b', 'b', 'b', 'e', 'f'))
data.2 = data.frame(plot2, spp2)

What I'm trying to do is, for each i plot, setdiff(unique(data.1$spp1), unique(data.2$spp2)) and add the obtained information to a dataframe that has 2 columns: plot and spp_name

For the example datasets I'd like to obtain a final dataframe such as:

df_result = data.frame(plot = c(1,1,2,2,3), spp_name = ('b','c','a','d',0)

0 (or similar) must be returned when the setdiff(unique()) returns 'character(0)', So, in a way, my df_result needs to have, for each i plot, length equal to the number of setdiff strings between data.1$spp1 and data.2$spp2.

The first thing I did was using a for loop based on each i plot. Getting to setdiff() string result is ok to but I don't know how to add this information to a empty dataframe...do I need to loop something for each species? I really hope my question is comprehensible.

Thanks already


Solution

  • You could use anti_join and add rows for the missing values:

    library(dplyr)
    
    anti_join(data.1, data.2, by = c("plot1" = "plot2", "spp1" = "spp2")) %>% 
      add_row(plot1 = setdiff(data.1$plot1, .$plot1))
    
    #  plot1 spp1
    #1     1    b
    #2     1    c
    #3     2    a
    #4     2    d
    #5     3 <NA>