Search code examples
rtime-seriescutnormal-distribution

How to find the cutting points of normal distributions (ordered data case) - R


Suppose we have 2 rnorm, rnorm1 = rnorm(100,0,1) and rnorm2 = rnorm(100,1,1) and then we define input = c(rnorm1, rnorm2).

  1. How can we find the cutting point(in this case is point 100) of these 2 distributions and keep the data in input ordered (not changing the order at all)?
  2. Further more, if we have multiple normal distributions(say more than 3), how can we do the same thing but not defining the number of distributions?

This question really bother me, could anyone give a favor?


Solution

  • For the simplest case, where you have 2 distributions and know the means of each, you can find the cutpoint by calculating the (log) likelihood for each possible cutpoint:

    x = rnorm(100, 0, 1)
    y = rnorm(100, 1, 1)
    combined = c(x, y)
    
    log_lik = function(cutpoint) {
        part1 = combined[1:cutpoint]
        part2 = combined[(cutpoint + 1):length(combined)]
       sum(dnorm(part1, mean = 0, log = TRUE)) +
        sum(dnorm(part2, mean = 1, log = TRUE))
    }
    
    res = sapply(1:length(combined), log_lik)
    plot(res)
    which.max(res)
    

    This is just an ad-hoc solution to the problem though, for more robust statistical procedures you probably want to look at something like a changepoint analysis.