Search code examples
rprobability

Survival probability based on continuous variable in R | Titanic dataset


Following is the titanic data set in which I am trying to find the conditional probability of survival based on sex and fare. Sex is a categorical variable and fare is continuous variable.

library(PASWR2)
library(magrittr)
library(data.table)

# convert dataset from data frame to data table 
titanic3 <- copy(TITANIC3)
setDT(titanic3)

The following statement finds the probability of the exact value of fare, however, I want to find it based on the probability distribution of the fare column.

titanic3[, survival_prob := round(100 * mean(survived), 1), by = .(fare, sex)]

I have tried to convert the fare variable from continuous to categorical and then calculated the probability, and the results were somewhat accurate however, probability change substantially based on the size of bins I create while making the categorical variable.

Is there a better way to do so?

Thanks.


Solution

  • You want to know the conditional probability of survival based on sex and fare. However, fare is a continuous variable. So you cannot simply apply your approach. In your scenario it is necessary to find a proper statistical approach.

    One approach is logistic regression. At first, you estimate a statistical model using logistic regression. Then you extract from object mdl the fitted values which correspond to the conditional probabilities you want. Note, however, that there are different statistical approaches to estimate conditional probabilities and logistic regression is only one of them. It is widely used for tasks like this one, though.

    library(PASWR2)
    library(magrittr)
    library(data.table)
    
    
    titanic3 <- copy(TITANIC3)
    setDT(titanic3)
    
    
    # use logistic regression to estimate the conditional probability to survive
    # based on fare and sex
    mdl <- glm(survived ~ fare + sex, family = binomial(), data = titanic3)
    
    # extract fitted values which corresponds to the conditional probability
    mdl$fitted.values