Search code examples
rsimulationnormal-distributionoutliers

simulation of normal distribution data contaiminated with outliers


I need to simulate 1000 sets of normal distribution(each 60 subgroups, n=5) by using r programming. Each set of normal distribution is contaiminated with 4 outliers(more than 1.5 IQR). can anyone help?

Thanks in advance


Solution

  • A very simple approach to create a data.frame with a few outliers :

    # Create a vector with normally distributed values and a few outliers
    # N - Number of random values
    # n.out - number of outliers
    my.rnorm <- function(N, num.out, mean=0, sd=1){
      x <- rnorm(N, mean = mean, sd = sd)
      ind <- sample(1:N, num.out, replace=FALSE )
      x[ind] <- (abs(x[ind]) + 3*sd) * sign(x[ind])
      x
    }
    
    N=60
    num.out = 4
    df <- data.frame( col1 = my.rnorm(N, num.out),
                      col2 = my.rnorm(N, num.out),
                      col3 = my.rnorm(N, num.out),
                      col4 = my.rnorm(N, num.out),
                      col5 = my.rnorm(N, num.out))
    

    Please note that I used mean=0 and sd=1 as values mean=1, sd=0 that you provided in the comments do not make much sense.

    The above approach does not guarantee that there will be exactly 4 outliers. There will be at least 4, but in some rare cases there could be more as rnorm() function does not guarantee that it never produces outliers.

    Another note is that data.frames might not be the best objects to store numeric values. If all your 1000 data.frames are numeric, it is better to store them in matrices.

    Depending on the final goal and the type of the object you store your data in (list, data.frame or matrix) there are faster ways to create 1000 objects filled with random values.