Search code examples
rprobability-densitystandardizedbell-curve

How to Standardize a Column of Data in R and Get Bell Curve Histogram to fins a percentage that falls within a ranges?


I have a data set and one of columns contains random numbers raging form 300 to 400. I'm trying to find what proportion of this column in between 320 and 350 using R. To my understanding, I need to standardize this data and creates a bell curve first. I have the mean and standard deviation but when I do (X - mean)/SD and get histogram from this column it's still not a bell curve.

This the code I tried.

myData$C1 <- (myData$C1 - C1_mean) / C1_SD

Solution

  • If you are simply counting the number of observations in that range, there's no need to do any standardization and you may directly use

    mean(myData$C1 >= 320 & myData$C1 <= 350)
    

    As for the standardization, it definitely doesn't create any "bell curves": it only shifts the distribution (centering) and rescales the data (dividing by the standard deviation). Other than that, the shape itself of the density function remains the same.

    For instance,

    x <- c(rnorm(100, mean = 300, sd = 20), rnorm(100, mean = 400, sd = 20))
    mean(x >= 320 & x <= 350)
    # [1] 0.065
    hist(x)
    hist((x - mean(x)) / sd(x))
    

    enter image description here

    I suspect that what you are looking for is an estimate of the true, unobserved proportion. The standardization procedure then would be applicable if you had to use tabulated values of the standard normal distribution function. However, in R we may do that without anything like that. In particular,

    pnorm(350, mean = mean(x), sd = sd(x)) - pnorm(320, mean = mean(x), sd = sd(x))
    # [1] 0.2091931
    

    That's the probability P(320 <= X <= 350), where X is normally distributed with mean mean(x) and standard deviation sd(x). The figure is quite different from that above since we misspecified the underlying distribution by assuming it to be normal; it actually is a mixture of two normal distributions.