I have the following observed values for one variable
vars
[1] 1053 1018 1048 1040 1046 1038 1014 1020 1038 1006 1016 1025 1011 1018 1014 1021 1022 1000 1003 1018 1037 1053 1033 1047 1018
[26] 1022 1014 1048 1048 1026 1028 1056 1015 1030 1008 1035 1007 1026 1021 1008 1053 1052 1009 1015 1020 1013 1023 1015 1018 1038
[51] 1007 1018 1035 1030 1016 1034 1028 1018 1018 1024 1024 1030 1028 1027 1016 1012 1023 1022 1045 1012 1019 1010 1086 1016 1034
[76] 1043 1027 1013 1022 1030 1022 1059 1021 1008 1023 1010 1016 1020 1062 1013 1064 1052 1018 1033 1000 1025 1011 1016 1017 1048
[101] 1030 1022 1018 1020 1011 1095 1038 1018 1042 1007 1048 1038 1024 1016 1008 1051 1014 1020 1024 1008 1046 1045 1064 1075 1035
[126] 1010 1010 1010 1012 1012 1013 1016 1039 1030 1003 1013 1025 1005 1000 1029 1028 1008 1014 1026 1020 1062 1023 1008 1032 1024
[151] 1026 1011 1018 1028 1031 1042 1057 1054 1010 1012 1014 1011 1008 1036 1060 1018 1102 1015 1022 1010 1018 1018 1016 1028 1035
[176] 1047 1026 1050 1013 1018 1014 1018 1027 1033 1028 1024 1034 1009 1009 1054 1015 1014 1018 1016 1027 1012 1014 1027 1010 1046
[201] 1000 1018 1028 1068 1022 1012 1026 1024 1000 1022 1012 1030 1024 1002 1016 1034 1050 1062 1005 1010 1011 1010 1006 1023 1012
[226] 1018 1018 1032 1042 1015 1054 1048 1010 1016 1018 1042 1049 1004 1042 1047 1014 1028 1008 1043 1008 1010 1026 1014 1028 1037
[251] 1024 1010 1045 1015 1010 1035 1036 1020 1017 1022 1064 1021 1025 1008 1012 1022 1024 1008 1027 1070 1034 1032 1012 1006 1036
[276] 1028 1006 1014 1018 1022 1020 1024 1020 1027 1019 1017 1010 1050 1036 1008 1022 1058 1030 1020 1000 1022 1000 1008 1000 1059
[301] 1013 1004 1019 1014 1000 1018 1020
How can I get a normal distribution by selecting just 60 of these observations? Is that correct to use the following code?
seq = seq(1, 60, 1)
y = rnorm(seq, mean(vars), sd(vars))
y
hist(y)
Thanks
To expand on my first comment, I loaded that data and built a density plot:
vars <- read.table(text="
[1] 1053 1018 1048 1040 1046 1038 1014 1020 1038 1006 1016 1025 1011 1018 1014 1021 1022 1000 1003 1018 1037 1053 1033 1047 1018
[26] 1022 1014 1048 1048 1026 1028 1056 1015 1030 1008 1035 1007 1026 1021 1008 1053 1052 1009 1015 1020 1013 1023 1015 1018 1038
[51] 1007 1018 1035 1030 1016 1034 1028 1018 1018 1024 1024 1030 1028 1027 1016 1012 1023 1022 1045 1012 1019 1010 1086 1016 1034
[76] 1043 1027 1013 1022 1030 1022 1059 1021 1008 1023 1010 1016 1020 1062 1013 1064 1052 1018 1033 1000 1025 1011 1016 1017 1048
[101] 1030 1022 1018 1020 1011 1095 1038 1018 1042 1007 1048 1038 1024 1016 1008 1051 1014 1020 1024 1008 1046 1045 1064 1075 1035
[126] 1010 1010 1010 1012 1012 1013 1016 1039 1030 1003 1013 1025 1005 1000 1029 1028 1008 1014 1026 1020 1062 1023 1008 1032 1024
[151] 1026 1011 1018 1028 1031 1042 1057 1054 1010 1012 1014 1011 1008 1036 1060 1018 1102 1015 1022 1010 1018 1018 1016 1028 1035
[176] 1047 1026 1050 1013 1018 1014 1018 1027 1033 1028 1024 1034 1009 1009 1054 1015 1014 1018 1016 1027 1012 1014 1027 1010 1046
[201] 1000 1018 1028 1068 1022 1012 1026 1024 1000 1022 1012 1030 1024 1002 1016 1034 1050 1062 1005 1010 1011 1010 1006 1023 1012
[226] 1018 1018 1032 1042 1015 1054 1048 1010 1016 1018 1042 1049 1004 1042 1047 1014 1028 1008 1043 1008 1010 1026 1014 1028 1037
[251] 1024 1010 1045 1015 1010 1035 1036 1020 1017 1022 1064 1021 1025 1008 1012 1022 1024 1008 1027 1070 1034 1032 1012 1006 1036
[276] 1028 1006 1014 1018 1022 1020 1024 1020 1027 1019 1017 1010 1050 1036 1008 1022 1058 1030 1020 1000 1022 1000 1008 1000 1059
[301] 1013 1004 1019 1014 1000 1018 1020", fill=TRUE)
vars <- unlist(vars[-1]) ; remove extraneous item numbering
Now plot the non-NA values
png(); plot(density(na.omit(vars)))
dev.off()
So the underlying distribution is right-skewed and a random draw would not be expected to be "Normal"
If you believe that there is normality to be found in this data (which I don't) then it would appear to me to be a mixture of normals. The mixtools package lets one estimate the proportions of a mixture if you specify plausible starting values and the number of components, 2 or 3 in this case. It also delivers the means and standard deviations for the best fit.
> set.seed(99)
> var.mixest <- mixtools::normalmixEM(na.omit(vars), lambda=c(0.8,0.2), mu=c(1018,1050), k=2)
number of iterations= 80
> str(var.mixest)
List of 9
$ x : int [1:307] 1053 1022 1007 1043 1030 1010 1026 1047 1000 1018 ...
$ lambda : num [1:2] 0.651 0.349
$ mu : num [1:2] 1017 1040
$ sigma : num [1:2] 8.04 17.98
$ loglik : num -1259
$ posterior : num [1:307, 1:2] 0.000194 0.850119 0.920264 0.019248 0.554677 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:2] "comp.1" "comp.2"
$ all.loglik: num [1:81] -1422 -1306 -1303 -1302 -1301 ...
$ restarts : num 0
$ ft : chr "normalmixEM"
- attr(*, "class")= chr "mixEM"
Also tried with three components:
> set.seed(99)
> var.mixest3 <- mixtools::normalmixEM(na.omit(vars), lambda=c(0.6,0.3,0.1), mu=c(1018,1050,1100), k=3)
number of iterations= 299
> str(var.mixest3)
List of 9
$ x : int [1:307] 1053 1022 1007 1043 1030 1010 1026 1047 1000 1018 ...
$ lambda : num [1:3] 0.6893 0.3016 0.0091
$ mu : num [1:3] 1017 1042 1095
$ sigma : num [1:3] 8.07 13.85 6.49
$ loglik : num -1254
$ posterior : num [1:307, 1:3] 0.000203 0.8989 0.979451 0.018125 0.588612 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:3] "comp.1" "comp.2" "comp.3"
$ all.loglik: num [1:300] -1465 -1297 -1287 -1283 -1278 ...
$ restarts : num 0
$ ft : chr "normalmixEM"
- attr(*, "class")= chr "mixEM"
So you can pick which of those normal distributions that you want to use as the template for your samples. Here's a hacky way to find a sample that minimizes the distance between a template and the data:
norm.est <- rnorm(60, mean=var.mixest3$mu[1], sd=var.mixest3$sigma[1])
> str(norm.est)
num [1:60] 1031 1012 1009 1017 1019 ...
> sapply(norm.est, function(x){which.min(abs(x-na.omit(vars)) )})
V67 V56 V148 V1011 V413 V413 V312 V148 V146 V210 V213 V211 V26 V213 V1011 V511 V211 V108 V63 V212 V212 V511
59 45 159 114 39 39 25 159 157 10 13 11 6 13 114 50 11 111 55 12 12 50
V23 V211 V108 V131 V210 V32 V56 V22 V131 V212 V131 V413 V511 V312 V23 V94 V55 V210 V104 V213 V211 V131
3 11 111 140 10 15 45 2 140 12 140 39 50 25 3 95 44 10 107 13 11 140
V63 V26 V63 V29 V211 V312 V94 V210 V104 V34 V96 V213 V148 V213 V43 V55
55 6 55 9 11 25 95 10 107 17 97 13 159 13 29 44
Those are just the indices into that data, so further processing is needed:
new.norm <- sapply(norm.est, function(x){which.min(abs(x-na.omit(vars)) )})
> plot( density(new.norm)) # Not a good result
> dat.norm <- na.omit(vars)[new.norm] # use the indices to pick from data
> plot( density(dat.norm))
That's a rather Normal appearing distribution.