Sample data:
df <- data.frame(HELP = c("Yes", "Yes", "Yes", "No", "Yes", "No"))
I did:
cdata <- ddply(df, c("HELP"), summarise,
Total = sum(df$HELP == 'No'),
Probability = Total/nrow(df))
but to the value to "Yes" stay the same value that "No". I tried to use "if" condition but didn't work.
What I want to do is to summarize, by Help, where it would have the sum of df.help == "No"
and the sum of df.help == "Yes"
, and their respective probabilities.
The end result should look something like this:
| | Help | Total | Probability |
|----|------|-------|--------------|
| 1 | Yes | 4 | 0.666 |
| 2 | No | 2 | 0.333 |
What is the appropriate way to go about this with ddply or other way?
Regards
I suggest using dplyr
, as you tagged. This allows you to easily group your data using group_by
, and using using summarise
and mutate
you can add new columns to achieve your desired result.
> library(dplyr)
> df %>% group_by(HELP) %>% summarise(Total = n()) %>% mutate(Probability = Total / sum(Total))
# A tibble: 2 x 3
HELP Total Probability
<fctr> <int> <dbl>
1 No 2 0.3333333
2 Yes 4 0.6666667
%>%
forwards the output from the command on the left, to the command on the right of the operator. You can chain several commands behind eachother, but while that works it can quickly become a mess to read.
group_by(HELP)
will divide your data frame in to those rows with identical values in HELP
. It can also take several columns.
summarise(Total = n())
-- n()
is another dplyr
function, that is set to the number of rows in your group. In both summarise
and mutate
new column names are provided without '
or "
mutate(Probability = Total / sum(Total))
-- simple calculation, based on the results just calculated in the step before