I am trying to get a new column say duration_probablity
which gets the probablity of a value falling between 6 and 12 hours . P(6 < Origin_Duration ≤ 12)
dput(df)
structure(list(CRD_NUM = c(1000120005478330, 1000130009109199,
1000140001635234, 1000140002374747, 1000140003618308, 1000140007236959,
1000140015078086, 1000140026268650, 1000140027281272, 1000148000012215
), Origin_Duration = c("10:48:38", "07:41:34", "11:16:41", "09:19:35",
"17:09:19", "08:59:05", "11:27:28", "12:17:41", "10:45:42", "12:19:05"
)), .Names = c("CRD_NUM", "Origin_Duration"), class = c("data.table",
"data.frame"), row.names = c(NA, -10L))
CRD_NUM Origin_Duration
1: 1000120005478330 10:48:38
2: 1000130009109199 07:41:34
3: 1000140001635234 11:16:41
4: 1000140002374747 09:19:35
5: 1000140003618308 17:09:19
6: 1000140007236959 08:59:05
7: 1000140015078086 11:27:28
8: 1000140026268650 12:17:41
9: 1000140027281272 10:45:42
10: 1000148000012215 12:19:05
I am not sure how to do that in R. I am trying to get cumulative distribution function of the standard normal distribution. The probability that a commuter’s stay-duration at certain station falling between 6 and 12 hours. The Output would be say for example 0.96 for duration 11:16:41
My CDF would be something like - P(6 <X≤ 12) = Φ((12−μ)/σ)−Φ((6−μ)/σ)
From your question it is unclear whether you already know the mean and variance or not. I will discuss both cases. Also, I will assume you have reason to believe that the durations are in fact normally distributed.
Known parameters: If you have a pre-specified mean and variance given. Say, mu = 11
and sigma = 3
. Then you can use that P(6 < X ≤ 12) = P(X ≤ 12) - P(X ≤ 6)
. The base R function pnorm()
is able to calculate this:
mu <- 11
sigma <- 3
pnorm(12, mu, sigma) - pnorm(6, mu, sigma)
# 0.5827683
Unknown parameters, P(6 < X < 12)
: If you do not yet know what the mean and variance are, you can use estimations from your data and use the student t-distribution instead of the normal distribution (the story why this is called 'student' distribution, is nice too. You can find it in the wikipedia link). In order to find the mean and variance, it makes sense to first transform df$Origin_Duration
from character to some time-type:
df$Origin_Duration <- as.POSIXct(df$Origin_Duration, format = "%H:%M:%S")
mu <- mean(df$Origin_Duration) # "2017-09-04 11:12:28 CEST"
df$demeaned <- df$Origin_Duration - mu
sigma <- var(df$demeaned)^0.5 # 153.68
Note that I subtracted the mean first, before calculating the variation. I did this in order to have the duration in minutes. The standard deviation is therefore to be read as 153.68 minutes.
We will use the pt
function to calculate the probability P(X ≤ 12) - P(X ≤ 6)
. In order to so, we'd need a standardised / scaled / normalised version of 12
and 6
. That is, we have to subtract the mean and divide by the standard deviation:
x6 <- as.numeric(difftime("2017-09-04 06:00:00", mu), unit = "mins")/sigma
x12 <- as.numeric(difftime("2017-09-04 12:00:00", mu), unit = "mins")/sigma
deg_fr <- length(df$demeaned)-1
p_x_smaller_than12 <- pt( x12, df = deg_fr ) # 0.6178973
p_x_smaller_than6 <- pt( x6, df = deg_fr ) # 0.03627651
p_x_smaller_than12 - p_x_smaller_than6
# [1] 0.5816208
Added in response to comment: Unknown parameters, all entries:
# scale gives the distance from the mean in terms of standard deviations:
df$scaled <- scale(df$Origin_Duration)
pt(df$scaled, df = deg_fr)
# [1,] 0.4400575
# [2,] 0.1015886
# [3,] 0.5106114
# [4,] 0.2406431
# [5,] 0.9773264
# [6,] 0.2039751
# [7,] 0.5377728
# [8,] 0.6593331
# [9,] 0.4327620
# [10,] 0.6625280