Search code examples
rdplyrstatisticstidyverseprobability

How to calculate probability in R


Hi so I'm taking a stats class, and we were given a dataset "NHANES" that we filtered down to get adult smokers --> "NHANES_adult".

library(NHANES)
# create a NHANES dataset without duplicated IDs 
NHANES <-
  NHANES %>%
  distinct(ID, .keep_all = TRUE) 

NHANES_adult <- NHANES %>%
  filter(Age >= 18) %>%  # only include individuals 18 or older
  filter(SmokeNow != 'NA')  # drop any observations with NA for SmokeNow

My prof asked the following:

1b. Now let's take a single sample of 100 individuals from the NHANES_adult dataframe, and compute the proportion of smokers, saving it to a variable called p_smokers.

set.seed(12345)  # PROVIDED CODE - this will cause it to create the same
                 # random sample each time

sample_size = 100 # size of each sample

p_smokers <- NHANES_adult %>%
  sample(sample_size) %>%  # take a sample from the data frame [I think this is okay]
  ____(____ = ____(____)) %>% # compute the probability of smoking [This is the point at which I'm struggling to understand what one-line function fits these blank parameters.
  ____()  # extract the variable from the data frame [I believe this is the mutate() function?]

p_smokers

Solution

  • Maybe this is what you are looking for. It seems that you should use sample_n() rather than sample(). To find the proportion in one line, you would use mean().

    sample_size <- 100
    
    NHANES_adult %>%
      sample_n(sample_size) %>%  
      summarize(p_smok = mean(SmokeNow == "Yes")) %>% 
      pull(p_smok)