Search code examples
rbioinformaticsnormal-distributionkernel-density

Testing ratio of density distributions for normality


I have a normal distribution and a uniform distribution. I want to calculate a ratio: the density of the normal distribution, over the density of the uniform. Then I want to test this ratio for normality.

ht <- runif(3000, 1, 18585056)           # Uniform distribution
hm <- rnorm(35, 10000000, 5000000)       # Normal distribution
hmd <- density(hm, from=0, to=18585056)  # Kernel density of distributions over range 
htd <- density(ht, from=0, to=18585056)
ratio <- hmd$y/htd$y                     # Ratio of kernel density values

The distributions hm and ht above are examples of what my experimental data shows; the vectors I will actually be using are not randomly generated in R.

I know that I can get a good idea of normality from the correlation coefficient of a Q-Q plot:

qqp <- qqnorm(hm)
cor(qqp$x,qqp$y)

For hm, which is normally distributed, this gives a value close to 1.

Is there a way of determining the normality of the density vectors? e.g. hmd and ratio.

(Additional information: hm and ht are modelling homozygous and heterozygous SNPs across a genome of length 18585056)


Solution

  • First, this is really a statistics question; you should consider posting it on stats.stackexchange.com - you are likely to get a better answer.

    Second, the short answer to your question is that "testing the ratio of two density functions for normality" is not a meaningful idea. As mentioned in the comment, the ratio of two density functions is not a density function. Among other things, a density function must integrate to 1 over (-Inf,+Inf), which this ratio will not (generally).

    It is meaningful, however, to test if the distribution of the ratio of two random variables is normal. If you know that the numerator is normally distributed and the denominator is uniformly distributed, then the ratio will definitely not be normally distributed, as demonstrated below in the discussion of the slash distribution.

    If you do not know the distributions of the numerator and denominator, but just have random samples, you should calculate the ratio of the random variates and test that for normality. In your case (with minor edits):

    set.seed(123)
    ht <- runif(3000, 1, 18585056)           
    hm <- rnorm(3500, 10000000, 5000000)
    Z  <- sample(hm,1000)/sample(ht,1000)   # numer. and denom. must be same length
    par(mfrow=c(1,2))
    # histogram of Z
    hist(Z,xlim=c(-5,5), breaks=c(-Inf,seq(-5,5,0.2),Inf),freq=F, ylim=c(0,.4))
    # normal Q-Q plot    
    qqnorm(Z,ylim=c(-5,5))
    qqline(Z,xlim=c(-5,5),lty=2,col="blue")
    

    Clearly, the ratio distribution is not normal.

    Slash Distribution

    In the special case

    X ~ N[0,1] = φ(x)       (-Inf ≤ x ≤ Inf), and

    Y ~ U[0,1] = 1           (0 ≤ x ≤ 1); 0 elsewhere

    Z = X/Y ~ [ φ(0) - φ(x) ]/x2

    That is, a random variable formed as the ratio of two other (independent) random variables, the numerator distributed as N(0,1) and the denominator distributed as U(0,1), has the slash distribution, defined above. We can show this in R code as follows

    set.seed(123)
    X <- rnorm(10000)
    Y <- runif(10000)
    Z <- X/Y
    dslash <- function(x) (dnorm(0)-dnorm(x))/x^2
    
    x <- seq(-5,5,0.02)
    par(mfrow=c(1,2))
    hist(Z,xlim=c(-5,5), breaks=c(-Inf,seq(-5,5,0.2),Inf),freq=F, ylim=c(0,.4))
    lines(x,dslash(x),xlim=c(-5,5),col="red")
    lines(x,dnorm(x),xlim=c(-5,5),col="blue",lty=2)
    
    qqnorm(Z,ylim=c(-5,5))
    qqline(Z,xlim=c(-5,5),lty=2,col="blue")
    

    The bars represent the histogram of Z = X/Y, the red curve is the slash distribution, and the blue curve is the pdf of N[0,1] for reference. Because the red curve is "bell shaped" there is a temptation to think that Z is normally distributed, just with a larger variance. The Q-Q plot shows clearly that this is not the case. The tails of the slash distribution are much larger than would be expected from a normal distribution.