Search code examples
rdataframematrix-multiplication

Creating a data frame by querying two other related data frames


I want to create a data frame that calculates abundances of traits from two other related data frames but don’t know what code would give me the desired output. The first data frame, species abundance, contains species (rows) by sites (columns). e.g.;

          BaCo_1 BaFa_1 BaSl_1 BrSl_1 CaCo_1 CaFa_1 CaSl_1
Abla.nota      0      0      3      2      9      0      0
Albo.woro      0      0      0      1      0      0      2
Aust.coll      0      2      0      0      2      0      0
Bibu.Kadj      0      0      1      0      0      0      0
cala.sp        0      0      0      0      0      7      1

The second dataframe, species traits, contains species (rows) by traits (columns);

          MFA2 MFA4 MFA5 MFA6 MFA12 MFA14 flow1 flow2 flow3
Abla.nota    1    0    0    0     0     0     1     0     0
Albo.woro    0    0    0    1     0     0     1     0     0
alel.aust    0    1    0    0     0     0     0     0     1
Aulo.stri    1    0    0    0     0     0     0     0     1
Aust.anac    0    0    0    0     1     0     1     0     0
Aust.coll    0    0    0    0     1     0     1     0     0
Aust.subt    0    1    0    0     0     0     0     0     1
bero.sp      0    1    0    0     0     0     1     0     0
Bibu.Kadj    1    0    0    0     0     0     0     0     1
Bran.sowe    0    0    1    0     0     0     1     0     0
cala.sp      0    0    0    1     0     0     1     0     0

Note that the “traits” data frame contains more species than the “sites” data frame. This is because the trait data frame contains all species in the study, of which the “sites” data frame is a subset. Both examples above are subsets of the full data.

From these, I want to create a third data frame, sites (rows) by traits (columns), a dataframe which has the trait abundances at each site, regardless of the species that possess those traits;

       MFA2 MFA4 MFA5 MFA6 MFA12 MFA14 flow1 flow2 flow3
BaCo_1    0    0    0    0     0     0     0     0     0
BaFa_1    0    0    0    0     2     0     2     0     0
BaSl_1    4    0    0    0     0     0     3     0     1
BrSl_1    2    0    0    1     0     0     3     0     0
CaCo_1    9    0    0    0     2     0    11     0     0
CaFa_1    0    0    0    7     0     0     7     0     0
CaSl_1    0    0    0    3     0     0     3     0     0

I created this output by hand because I don’t know what code would produce this output. Each value is a summation of the abundances of all individuals in the sample (row) that possess a trait (column). For example, CaCo_1 x flow1 in the example output, 11, above is the summation of the 9 alba.nota and 2 Aust.coll from the first data frame that possess the flow1 trait identified in the second dataframe.

What code and package should I use to produce this outcome?

sites_df <- structure(list(BaCo_1 = c(0, 0, 0, 0, 0), BaFa_1 = c(0, 0, 2, 
0, 0), BaSl_1 = c(3, 0, 0, 1, 0), BrSl_1 = c(2, 1, 0, 0, 0), 
    CaCo_1 = c(9, 0, 2, 0, 0), CaFa_1 = c(0, 0, 0, 0, 7), CaSl_1 = c(0, 
    2, 0, 0, 1)), row.names = c("Abla.nota", "Albo.woro", "Aust.coll", 
"Bibu.Kadj", "cala.sp"), class = "data.frame")

traits_df <- structure(list(MFA2 = c(1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0), MFA4 = c(0, 
0, 1, 0, 0, 0, 1, 1, 0, 0, 0), MFA5 = c(0, 0, 0, 0, 0, 0, 0, 
0, 0, 1, 0), MFA6 = c(0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1), MFA12 = c(0, 
0, 0, 0, 1, 1, 0, 0, 0, 0, 0), MFA14 = c(0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0), flow1 = c(1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1), flow2 = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0), flow3 = c(0, 0, 1, 1, 0, 0, 1, 
0, 1, 0, 0)), row.names = c("Abla.nota", "Albo.woro", "alel.aust", 
"Aulo.stri", "Aust.anac", "Aust.coll", "Aust.subt", "bero.sp", 
"Bibu.Kadj", "Bran.sowe", "cala.sp"), class = "data.frame")

Solution

  • First, we transform the data.frames to tibbles and turn the row names into columns. Second we left_join() the two datasets and pivot_longer() the trait columns. We can now multiply the site columns with the trait values and finally summarise() the data by the site names with sum().

    library(tidyverse)
    
    traits_df2 <- 
      traits_df |> 
      as_tibble(rownames = "row_names") 
    
    sites_df2 <- 
      sites_df |> 
      as_tibble(rownames = "row_names")  
    
    left_join(sites_df2, traits_df2) |> 
      pivot_longer(-c(row_names, (MFA2:flow3)), names_to = "site") |> 
      mutate(across(MFA2:flow3, \(x) x * value)) |> 
      group_by(site) |> 
      summarise(across(MFA2:flow3, sum))
    #> Joining with `by = join_by(row_names)`
    #> # A tibble: 7 × 10
    #>   site    MFA2  MFA4  MFA5  MFA6 MFA12 MFA14 flow1 flow2 flow3
    #>   <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
    #> 1 BaCo_1     0     0     0     0     0     0     0     0     0
    #> 2 BaFa_1     0     0     0     0     2     0     2     0     0
    #> 3 BaSl_1     4     0     0     0     0     0     3     0     1
    #> 4 BrSl_1     2     0     0     1     0     0     3     0     0
    #> 5 CaCo_1     9     0     0     0     2     0    11     0     0
    #> 6 CaFa_1     0     0     0     7     0     0     7     0     0
    #> 7 CaSl_1     0     0     0     3     0     0     3     0     0