Search code examples
rvectorsummary

Testing over a vector of variables and summing over a table, creating new columns in R


I have a table like this:

df <- read.table(text = 
                "  Day      city    gender     week
                 'day1'    'city1'   'M'       'one'
                 'day2'    'city2'   'M'       'two'
                 'day1'    'city3'   'F'       'two'
                 'day2'    'city4'   'F'       'two'", 
                 header = TRUE, stringsAsFactors = FALSE) 

I'm computing a summary table like this:

daily_table <- setDT(df)[, .(Daily_Freq = .N,
                             men = sum(gender == 'M'),
                             women = sum(gender == 'F'),
                             city1 = sum(city == 'city1'),
                             city2 = sum(city == 'city2'),
                             city3 = sum(city == 'city3'),
                             city4 = sum(city == 'city4'),
                             city5 = sum(city == 'city5'))
                         , by = .(week,Day)]

making this table:

   week  Day Daily_Freq men women city1 city2 city3 city4 city5
    one day1          1   1     0     1     0     0     0     0
    two day2          2   1     1     0     1     0     1     0
    two day1          1   0     1     0     0     1     0     0

But because I have several cities, I would like to use a vector with their names:

cities <- c("city1","city2","city3","city4","city5")

Note that I have 5 cities in my vector even that one of them has zero occurencies I want it to appear in my final table. How can I do it?


Solution

  • In order to ensure that R shows you city5 even though there are no observations with that value, add it as a factor level:

    setDT(df)
    
    df[, city :=  factor(city,
                         levels = c("city1","city2","city3","city4","city5"))]
    

    To avoid the need to write out tests for each level of city you can iterate over the levels of city, like this:

    daily_table <- df[, c(.(Daily_Freq = .N,
                            men = sum(gender == 'M'),
                            women = sum(gender == 'F')),
                          lapply(setNames(levels(city), levels(city)),
                                 function(x) sum(city == x))),
                      by = .(week,Day)]
    daily_table
    ##    week  Day Daily_Freq men women city1 city2 city3 city4 city5
    ## 1:  one day1          1   1     0     1     0     0     0     0
    ## 2:  two day2          2   1     1     0     1     0     1     0
    ## 3:  two day1          1   0     1     0     0     1     0     0