Search code examples
rdistributionnormal-distribution

How to obtain a normal distributition with snew number of obervation from a wide observation subset


I have the following observed values for one variable

vars
  [1] 1053 1018 1048 1040 1046 1038 1014 1020 1038 1006 1016 1025 1011 1018 1014 1021 1022 1000 1003 1018 1037 1053 1033 1047 1018
 [26] 1022 1014 1048 1048 1026 1028 1056 1015 1030 1008 1035 1007 1026 1021 1008 1053 1052 1009 1015 1020 1013 1023 1015 1018 1038
 [51] 1007 1018 1035 1030 1016 1034 1028 1018 1018 1024 1024 1030 1028 1027 1016 1012 1023 1022 1045 1012 1019 1010 1086 1016 1034
 [76] 1043 1027 1013 1022 1030 1022 1059 1021 1008 1023 1010 1016 1020 1062 1013 1064 1052 1018 1033 1000 1025 1011 1016 1017 1048
[101] 1030 1022 1018 1020 1011 1095 1038 1018 1042 1007 1048 1038 1024 1016 1008 1051 1014 1020 1024 1008 1046 1045 1064 1075 1035
[126] 1010 1010 1010 1012 1012 1013 1016 1039 1030 1003 1013 1025 1005 1000 1029 1028 1008 1014 1026 1020 1062 1023 1008 1032 1024
[151] 1026 1011 1018 1028 1031 1042 1057 1054 1010 1012 1014 1011 1008 1036 1060 1018 1102 1015 1022 1010 1018 1018 1016 1028 1035
[176] 1047 1026 1050 1013 1018 1014 1018 1027 1033 1028 1024 1034 1009 1009 1054 1015 1014 1018 1016 1027 1012 1014 1027 1010 1046
[201] 1000 1018 1028 1068 1022 1012 1026 1024 1000 1022 1012 1030 1024 1002 1016 1034 1050 1062 1005 1010 1011 1010 1006 1023 1012
[226] 1018 1018 1032 1042 1015 1054 1048 1010 1016 1018 1042 1049 1004 1042 1047 1014 1028 1008 1043 1008 1010 1026 1014 1028 1037
[251] 1024 1010 1045 1015 1010 1035 1036 1020 1017 1022 1064 1021 1025 1008 1012 1022 1024 1008 1027 1070 1034 1032 1012 1006 1036
[276] 1028 1006 1014 1018 1022 1020 1024 1020 1027 1019 1017 1010 1050 1036 1008 1022 1058 1030 1020 1000 1022 1000 1008 1000 1059
[301] 1013 1004 1019 1014 1000 1018 1020

How can I get a normal distribution by selecting just 60 of these observations? Is that correct to use the following code?

seq = seq(1, 60, 1)
y = rnorm(seq, mean(vars), sd(vars))
y
hist(y)

Thanks


Solution

  • To expand on my first comment, I loaded that data and built a density plot:

    vars <- read.table(text="
      [1] 1053 1018 1048 1040 1046 1038 1014 1020 1038 1006 1016 1025 1011 1018 1014 1021 1022 1000 1003 1018 1037 1053 1033 1047 1018
     [26] 1022 1014 1048 1048 1026 1028 1056 1015 1030 1008 1035 1007 1026 1021 1008 1053 1052 1009 1015 1020 1013 1023 1015 1018 1038
     [51] 1007 1018 1035 1030 1016 1034 1028 1018 1018 1024 1024 1030 1028 1027 1016 1012 1023 1022 1045 1012 1019 1010 1086 1016 1034
     [76] 1043 1027 1013 1022 1030 1022 1059 1021 1008 1023 1010 1016 1020 1062 1013 1064 1052 1018 1033 1000 1025 1011 1016 1017 1048
    [101] 1030 1022 1018 1020 1011 1095 1038 1018 1042 1007 1048 1038 1024 1016 1008 1051 1014 1020 1024 1008 1046 1045 1064 1075 1035
    [126] 1010 1010 1010 1012 1012 1013 1016 1039 1030 1003 1013 1025 1005 1000 1029 1028 1008 1014 1026 1020 1062 1023 1008 1032 1024
    [151] 1026 1011 1018 1028 1031 1042 1057 1054 1010 1012 1014 1011 1008 1036 1060 1018 1102 1015 1022 1010 1018 1018 1016 1028 1035
    [176] 1047 1026 1050 1013 1018 1014 1018 1027 1033 1028 1024 1034 1009 1009 1054 1015 1014 1018 1016 1027 1012 1014 1027 1010 1046
    [201] 1000 1018 1028 1068 1022 1012 1026 1024 1000 1022 1012 1030 1024 1002 1016 1034 1050 1062 1005 1010 1011 1010 1006 1023 1012
    [226] 1018 1018 1032 1042 1015 1054 1048 1010 1016 1018 1042 1049 1004 1042 1047 1014 1028 1008 1043 1008 1010 1026 1014 1028 1037
    [251] 1024 1010 1045 1015 1010 1035 1036 1020 1017 1022 1064 1021 1025 1008 1012 1022 1024 1008 1027 1070 1034 1032 1012 1006 1036
    [276] 1028 1006 1014 1018 1022 1020 1024 1020 1027 1019 1017 1010 1050 1036 1008 1022 1058 1030 1020 1000 1022 1000 1008 1000 1059
    [301] 1013 1004 1019 1014 1000 1018 1020", fill=TRUE)
    vars <- unlist(vars[-1]) ; remove  extraneous item numbering
    

    Now plot the non-NA values

    png();  plot(density(na.omit(vars)))
     dev.off()
    

    So the underlying distribution is right-skewed and a random draw would not be expected to be "Normal"

    enter image description here

    If you believe that there is normality to be found in this data (which I don't) then it would appear to me to be a mixture of normals. The mixtools package lets one estimate the proportions of a mixture if you specify plausible starting values and the number of components, 2 or 3 in this case. It also delivers the means and standard deviations for the best fit.

    > set.seed(99)  
    > var.mixest <- mixtools::normalmixEM(na.omit(vars), lambda=c(0.8,0.2), mu=c(1018,1050), k=2)
    number of iterations= 80 
    > str(var.mixest)
    List of 9
     $ x         : int [1:307] 1053 1022 1007 1043 1030 1010 1026 1047 1000 1018 ...
     $ lambda    : num [1:2] 0.651 0.349
     $ mu        : num [1:2] 1017 1040
     $ sigma     : num [1:2] 8.04 17.98
     $ loglik    : num -1259
     $ posterior : num [1:307, 1:2] 0.000194 0.850119 0.920264 0.019248 0.554677 ...
      ..- attr(*, "dimnames")=List of 2
      .. ..$ : NULL
      .. ..$ : chr [1:2] "comp.1" "comp.2"
     $ all.loglik: num [1:81] -1422 -1306 -1303 -1302 -1301 ...
     $ restarts  : num 0
     $ ft        : chr "normalmixEM"
     - attr(*, "class")= chr "mixEM"
    

    Also tried with three components:

    > set.seed(99)  
    > var.mixest3 <- mixtools::normalmixEM(na.omit(vars), lambda=c(0.6,0.3,0.1), mu=c(1018,1050,1100), k=3)
    number of iterations= 299 
    > str(var.mixest3)
    List of 9
     $ x         : int [1:307] 1053 1022 1007 1043 1030 1010 1026 1047 1000 1018 ...
     $ lambda    : num [1:3] 0.6893 0.3016 0.0091
     $ mu        : num [1:3] 1017 1042 1095
     $ sigma     : num [1:3] 8.07 13.85 6.49
     $ loglik    : num -1254
     $ posterior : num [1:307, 1:3] 0.000203 0.8989 0.979451 0.018125 0.588612 ...
      ..- attr(*, "dimnames")=List of 2
      .. ..$ : NULL
      .. ..$ : chr [1:3] "comp.1" "comp.2" "comp.3"
     $ all.loglik: num [1:300] -1465 -1297 -1287 -1283 -1278 ...
     $ restarts  : num 0
     $ ft        : chr "normalmixEM"
     - attr(*, "class")= chr "mixEM"
    

    So you can pick which of those normal distributions that you want to use as the template for your samples. Here's a hacky way to find a sample that minimizes the distance between a template and the data:

    norm.est <- rnorm(60, mean=var.mixest3$mu[1], sd=var.mixest3$sigma[1])
    
    > str(norm.est)
     num [1:60] 1031 1012 1009 1017 1019 ...
    > sapply(norm.est, function(x){which.min(abs(x-na.omit(vars)) )})
      V67   V56  V148 V1011  V413  V413  V312  V148  V146  V210  V213  V211   V26  V213 V1011  V511  V211  V108   V63  V212  V212  V511 
       59    45   159   114    39    39    25   159   157    10    13    11     6    13   114    50    11   111    55    12    12    50 
      V23  V211  V108  V131  V210   V32   V56   V22  V131  V212  V131  V413  V511  V312   V23   V94   V55  V210  V104  V213  V211  V131 
        3    11   111   140    10    15    45     2   140    12   140    39    50    25     3    95    44    10   107    13    11   140 
      V63   V26   V63   V29  V211  V312   V94  V210  V104   V34   V96  V213  V148  V213   V43   V55 
       55     6    55     9    11    25    95    10   107    17    97    13   159    13    29    44 
    

    Those are just the indices into that data, so further processing is needed:

    new.norm <- sapply(norm.est, function(x){which.min(abs(x-na.omit(vars)) )})
    > plot( density(new.norm))    # Not a good result
    > dat.norm <- na.omit(vars)[new.norm] # use the indices to pick from data
    > plot( density(dat.norm))
    

    enter image description here

    That's a rather Normal appearing distribution.