Search code examples
rdataframevegan

Calculate relative abundance by row label in R? (vegan package?)


I'm trying to calculate relative abundances based based on row labels or names (get relative abundance for each test in df$path1. So I'd like to calculate the relative abundance of counts from test1, and calculate relative abundance of counts from test2 separately. The sum of the relative abundance numbers from test1 would equal 1.

I'm currently using the vegan package, but open to other options.

Test dataset:

library(vegan)
df <- data.frame(x = c("a", "b", "c", "d", "e"), 
                 path1 = c("test1", "test1", "test2", "test2", "test3"),
                 value = c(40, 10, 34, 12, 20))
df$relabun <- decostand(df[3], 2, method = "total") #takes relative abundace of whole column

Ideal output for relative abundance based on df$path1, would look like this:

x path1 relabun_bypath1
a test1 0.8
b test1 0.2
c test2 0.74
d test2 0.26
e test3 1

Solution

  • This is a classic split–apply–combine question. The most literal way in base R is to

    • split the data.frame by group with split,
    • apply a function with *apply, and
    • combine with do.call(rbind, ... ) or unlist.

    so

    unlist(lapply(split(df, df$path1), function(x){x$value / sum(x$value)}))
    #    test11    test12    test21    test22     test3 
    # 0.8000000 0.2000000 0.7391304 0.2608696 1.0000000 
    

    which we can assign to a new variable. However, base has a nice if oddly-named function called ave which can apply a function across groups for us:

    ave(df$value, df$path1, FUN = function(x){x / sum(x)})
    # [1] 0.8000000 0.2000000 0.7391304 0.2608696 1.0000000
    

    which is a good deal more concise, and can likewise be assigned to a new variable.

    If you prefer the Hadleyverse, dplyr's grouping can make the process more readable:

    library(dplyr)
    df %>% group_by(path1) %>% mutate(relAbundByPath = value / sum(value))
    # Source: local data frame [5 x 4]
    # Groups: path1 [3]
    # 
    #        x  path1 value relAbundByPath
    #   (fctr) (fctr) (dbl)          (dbl)
    # 1      a  test1    40      0.8000000
    # 2      b  test1    10      0.2000000
    # 3      c  test2    34      0.7391304
    # 4      d  test2    12      0.2608696
    # 5      e  test3    20      1.0000000
    

    As you can see, it returns a new version of the data.frame, which we can use to overwrite the existing one or make a new copy.

    Whichever route you choose, get comfortable with the logic, because you'll likely use it a lot. Better, learn all of them. And tapply and mapply/Map. And data.table...why not?


    Note: You can also replace the value / sum(value)) construct with the prop.table function if you like. It's more concise (e.g. ave(df$value, df$path1, FUN = prop.table)), but less obvious what it's doing, which is why I didn't use it here.