Search code examples
rvariablesgini

Blau index of Diversity in R


I am trying to calculate the Blau index of diversity (gini-simpson) in R on my data frame. I have 6 columns for each person in a group, with values ranging from "Student", "Faculty", "Alumni" "Not Applicable". There are also NA's within the columns if a group is smaller than 6.

I would like to calculate the Blau index across the rows (the diversity across the entire group) not within each column, with na.rm= TRUE.

Does anyone know how to do this in R?

Thanks so much!

See here for a picture of data frame


Solution

  • We can calculate the Gini-Simpson index quite easily by hand.

    First off, I'll generate some sample data:

    # Generate sample data
    set.seed(2017);
    type <- c("Student", "Faculty", "Alumni");
    data <- sample(type, 6 * 20, replace = TRUE);
    
    # Replace 40 entries with NAs
    set.seed(2017);
    data[sample(6 * 20, 40)] <- NA;
    
    # Reformat as 6 column dataframe
    df <- as.data.frame(matrix(data, ncol = 6), stringsAsFactors = FALSE);
    names(df) <- paste0("e", seq(1:6), "_affiliation");
    head(df);
    #e1_affiliation e2_affiliation e3_affiliation e4_affiliation e5_affiliation
    #1           <NA>        Faculty           <NA>        Student        Student
    #2           <NA>           <NA>           <NA>        Faculty         Alumni
    #3           <NA>         Alumni        Student        Faculty        Faculty
    #4        Student           <NA>           <NA>           <NA>           <NA>
    #5           <NA>        Student         Alumni         Alumni        Student
    #6         Alumni         Alumni        Faculty        Faculty        Student
    # e6_affiliation
    #1         Alumni
    #2         Alumni
    #3           <NA>
    #4        Student
    #5        Faculty
    #6        Student
    

    The Gini-Simpson (= Gibbs-Martin = Blau) index of diversity is given by

    enter image description here

    where R denotes the total number of types, and enter image description here is the proportional abundance of the ith type.

    We define a function that takes a vector of strings and returns the GS index:

    # Define function to calculate the Gini-Simpson index
    # We ensure the same levels (present or absent) of x
    # by factor(x, levels = type)
    # Note that NAs will not be considered by default
    get.GS.index <- function(x, type) {
        x <- factor(x, levels = type);
        return(1 - sum(prop.table(table(x))^2));
    }
    

    We can now apply get.GS.index to all rows of the dataframe

    apply(df, 1, get.GS.index, type)
    #[1] 0.6250000 0.4444444 0.6250000 0.0000000 0.6400000 0.6666667 0.5000000
    #[8] 0.6250000 0.6400000 0.5000000 0.4444444 0.6400000 0.3750000 0.3750000
    #[15] 0.0000000 0.0000000 0.6111111 0.4444444 0.6666667 0.6400000
    

    Update

    We can modify the function get.GS.index to return NA if there is only one type present in a group.

    get.GS.index <- function(x, type) {
        x <- factor(x, levels = type);
        t <- table(x);
        if (length(t[t>0]) == 1) return(NA) else return(1 - sum(prop.table(t)^2));
    }
    
    apply(df, 1, get.GS.index, type);
    # [1] 0.6250000 0.4444444 0.6250000        NA 0.6400000 0.6666667 0.5000000
    # [8] 0.6250000 0.6400000 0.5000000 0.4444444 0.6400000 0.3750000 0.3750000
    #[15]        NA        NA 0.6111111 0.4444444 0.6666667 0.6400000