Search code examples
rsubsetgini

R: apply function to subsets based on column value


I have a data frame called income.df that looks something like this:

ID region income
1 rot 3700
2 ams 2500
3 utr 3300
4 utr 5300
5 utr 4400
6 ams 3100
8 ams 3000
9 rot 4000
10 rot 4400
12 rot 2000

I want to use the Gini function to compute the Gini coefficient for each region. If I wanted to compute it for the whole dataframe, without taking region into account, I would do the following:

library(DescTools)
Gini(income.df$income, n = rep(1, length(income.df$income)), unbiased = TRUE, conf.level = NA, R = 1000, type = "bca", na.rm = TRUE)

Is there a way to do this for each region within the dataframe? So in this case for "rot", "utr", and "ams"? Note that the Gini function also needs the length of the vector in there (which would be 4, 3, and 3 for the three regions respectively). I suspect something like lapply could do this, but I couldn't figure out how to automatically pass those lengths within the function (my actual dataframe is a lot larger, so manually is not an option).


Solution

  • Using Base R:

    library(DescTools)
    lapply(split(df,df$region), 
           function(x) (Gini(x$income, n = rep(1, length(x$income)), unbiased = TRUE, 
                             conf.level = NA, R = 1000, type = "bca", na.rm = TRUE)))
    

    Using tidyverse:

    library(tidyverse)
    library(DescTools)
    df %>% group_by(region) %>% nest() %>% 
           mutate(gini_coef = map(data, ~Gini(.x$income, n = rep(1, length(.x$income)), 
                  unbiased = TRUE, conf.level = NA, R = 1000, type = "bca", na.rm = TRUE))) %>%
           select(-data) %>% unnest() %>% left_join(df)
    
    
    Joining, by = "region"
    # A tibble: 10 x 4
    region   gini_coef ID  income
    <fct>    <dbl>   <int>  <int>
    1 rot    0.177      1   3700
    2 rot    0.177      9   4000
    3 rot    0.177     10   4400
    4 rot    0.177     12   2000
    5 ams    0.0698     2   2500
    6 ams    0.0698     6   3100
    7 ams    0.0698     8   3000
    8 utr    0.154      3   3300
    9 utr    0.154      4   5300
    10 utr    0.154      5   4400
    

    Data

     df <- read.table(text="  
                ID region income
                 1 rot 3700
                 2 ams 2500
                 3 utr 3300
                 4 utr 5300
                 5 utr 4400
                 6 ams 3100
                 8 ams 3000
                 9 rot 4000
                 10 rot 4400
                 12 rot 2000
                 ",header=T)