Search code examples
rdplyrconditional-statementssummarize

Summarize using condition for a single column


Sample data:

df <- data.frame(HELP = c("Yes", "Yes", "Yes", "No", "Yes", "No"))

I did:

cdata <- ddply(df, c("HELP"), summarise,
           Total = sum(df$HELP == 'No'),
           Probability = Total/nrow(df))

but to the value to "Yes" stay the same value that "No". I tried to use "if" condition but didn't work.

What I want to do is to summarize, by Help, where it would have the sum of df.help == "No" and the sum of df.help == "Yes", and their respective probabilities.

The end result should look something like this:

|    | Help | Total | Probability  |
|----|------|-------|--------------|
|  1 | Yes  | 4     | 0.666        |
|  2 | No   | 2     | 0.333        |

What is the appropriate way to go about this with ddply or other way?

Regards


Solution

  • I suggest using dplyr, as you tagged. This allows you to easily group your data using group_by, and using using summarise and mutate you can add new columns to achieve your desired result.

    > library(dplyr)
    > df %>% group_by(HELP) %>% summarise(Total = n()) %>% mutate(Probability = Total / sum(Total))
    # A tibble: 2 x 3
        HELP Total Probability
      <fctr> <int>       <dbl>
    1     No     2   0.3333333
    2    Yes     4   0.6666667
    

    Explanation

    %>% forwards the output from the command on the left, to the command on the right of the operator. You can chain several commands behind eachother, but while that works it can quickly become a mess to read.

    group_by(HELP) will divide your data frame in to those rows with identical values in HELP. It can also take several columns.

    summarise(Total = n()) -- n() is another dplyr function, that is set to the number of rows in your group. In both summarise and mutate new column names are provided without ' or "

    mutate(Probability = Total / sum(Total)) -- simple calculation, based on the results just calculated in the step before