Search code examples
rnormal-distribution

Probability of a column between a range for a Normal Distribution


I am trying to get a new column say duration_probablity which gets the probablity of a value falling between 6 and 12 hours . P(6 < Origin_Duration ≤ 12)

 dput(df)
structure(list(CRD_NUM = c(1000120005478330, 1000130009109199, 
1000140001635234, 1000140002374747, 1000140003618308, 1000140007236959, 
1000140015078086, 1000140026268650, 1000140027281272, 1000148000012215
), Origin_Duration = c("10:48:38", "07:41:34", "11:16:41", "09:19:35", 
"17:09:19", "08:59:05", "11:27:28", "12:17:41", "10:45:42", "12:19:05"
)), .Names = c("CRD_NUM", "Origin_Duration"), class = c("data.table", 
"data.frame"), row.names = c(NA, -10L))

            CRD_NUM Origin_Duration
 1: 1000120005478330        10:48:38
 2: 1000130009109199        07:41:34
 3: 1000140001635234        11:16:41
 4: 1000140002374747        09:19:35
 5: 1000140003618308        17:09:19
 6: 1000140007236959        08:59:05
 7: 1000140015078086        11:27:28
 8: 1000140026268650        12:17:41
 9: 1000140027281272        10:45:42
10: 1000148000012215        12:19:05

I am not sure how to do that in R. I am trying to get cumulative distribution function of the standard normal distribution. The probability that a commuter’s stay-duration at certain station falling between 6 and 12 hours. The Output would be say for example 0.96 for duration 11:16:41

My CDF would be something like - P(6 <X≤ 12) = Φ((12−μ)/σ)−Φ((6−μ)/σ)


Solution

  • From your question it is unclear whether you already know the mean and variance or not. I will discuss both cases. Also, I will assume you have reason to believe that the durations are in fact normally distributed.

    Known parameters: If you have a pre-specified mean and variance given. Say, mu = 11 and sigma = 3. Then you can use that P(6 < X ≤ 12) = P(X ≤ 12) - P(X ≤ 6). The base R function pnorm() is able to calculate this:

    mu    <- 11
    sigma <- 3
    pnorm(12, mu, sigma) - pnorm(6, mu, sigma)
    # 0.5827683
    

    Unknown parameters, P(6 < X < 12): If you do not yet know what the mean and variance are, you can use estimations from your data and use the student t-distribution instead of the normal distribution (the story why this is called 'student' distribution, is nice too. You can find it in the wikipedia link). In order to find the mean and variance, it makes sense to first transform df$Origin_Duration from character to some time-type:

    df$Origin_Duration <- as.POSIXct(df$Origin_Duration, format = "%H:%M:%S")
    
    mu          <- mean(df$Origin_Duration)       # "2017-09-04 11:12:28 CEST"
    df$demeaned <- df$Origin_Duration - mu
    sigma       <- var(df$demeaned)^0.5           # 153.68 
    

    Note that I subtracted the mean first, before calculating the variation. I did this in order to have the duration in minutes. The standard deviation is therefore to be read as 153.68 minutes.

    We will use the pt function to calculate the probability P(X ≤ 12) - P(X ≤ 6). In order to so, we'd need a standardised / scaled / normalised version of 12 and 6. That is, we have to subtract the mean and divide by the standard deviation:

    x6  <- as.numeric(difftime("2017-09-04 06:00:00", mu), unit = "mins")/sigma
    x12 <- as.numeric(difftime("2017-09-04 12:00:00", mu), unit = "mins")/sigma
    
    deg_fr <- length(df$demeaned)-1
    
    p_x_smaller_than12 <- pt( x12, df = deg_fr )    #  0.6178973
    p_x_smaller_than6  <- pt( x6,  df = deg_fr )    #  0.03627651
    p_x_smaller_than12 - p_x_smaller_than6
    # [1] 0.5816208
    

    Added in response to comment: Unknown parameters, all entries:

    # scale gives the distance from the mean in terms of standard deviations:
    df$scaled <- scale(df$Origin_Duration)
    
    pt(df$scaled, df = deg_fr)
    # [1,] 0.4400575
    # [2,] 0.1015886
    # [3,] 0.5106114
    # [4,] 0.2406431
    # [5,] 0.9773264
    # [6,] 0.2039751
    # [7,] 0.5377728
    # [8,] 0.6593331
    # [9,] 0.4327620
    # [10,] 0.6625280