Search code examples
rplothistogramcluster-analysisstatistics-bootstrap

Hist with lines in R


I generate 4 parts of big data: cluster1(10000 points), cluster2(15000 points), cluster3(15000 points) and throws(500 points). Here is the code:

library('MASS')
library('fpc')
#library("dbscan")
library("factoextra")
library("clustertend")
library("boot")
library("stream")
set.seed(123)
mu1<-c(-5,-7)
mu1
sigma1<-matrix(c(4,-2,-2,2), nrow=2, ncol=2, byrow = TRUE)
sigma1
n<-10000
cluster1<-mvrnorm(n,mu1,sigma1)
cluster1
#cluster1<-as.data.frame(cluster1)
#cluster1
#c<-runif(10000,1,1000)
#c

phi <- runif(15000, max = 2*pi)
rho <- sqrt(runif(15000))
x <- sqrt(5)*rho*cos(phi) + 6
y <- sqrt(10/3)*rho*sin(phi) + 4
range(2*(x - 6)^2 + 3*(y - 4)^2)
#[1] 0.001536582 9.999425234
plot(x, y)
cluster2<-cbind(x,y)
cluster2

u <- runif(15000, max = 3)
v <- runif(15000, max = 2)
x <- u + v - 10
y <- v - u + 8
range(x + y)
#[1] -1.999774  1.999826
range(x - y + 15)
#[1] -2.999646  2.999692
plot(x, y)
cluster3<-cbind(x,y)
cluster3
#cluster3<-as.data.frame(cluster1)
#cluster3

x <- runif(500, -20, 20)
y <- runif(500, -20, 20)
#u <- runif(500, max = 20)
#v <- runif(500, max = 20)
#x <- u + v - 20
#y <- v - u
range(x)
range(y)
plot(x,y)
throws<-cbind(x,y)
throws

data<-rbind(cluster1,cluster2,cluster3,throws)
data<-as.data.frame(data)
data
plot(data)

Then I try by using the bootstrap method, construct a distribution of H statistics for some fixed m, which is from 7% of the total number of generated points(m=2835). Here is th code where I do this:

B<-10#number of iterations
H<-NULL#value of Hopkins statistic
for(i in 1:B){
  N<-dim(data)[1]
  s<-sample(N,0.8*N)
  stat<-hopkins(data[s,], n=2835, byrow = TRUE)$H 
  H[i]<-stat
  #print(c(i, stat))
}

It takes very to generate. Then I should to compare this result with beta distribution - B(m,m). Here is the code:

hist(H)
#(density(H), col="red")
#hist(distB)
X<-seq(min(H), max(H), 0.001)
X
lines(X, dbeta(X,2835,2835), type="l", col="red")

The problem is that lined doesn't draw on hist. Can anybody say what is the problem? Here is the image, I see red line, but it's not exactly right. Hist with lines


Solution

  • Your y-axis values plotted by dbeta() are way too low to register on the supplied y-axis (<0.0000001). You need to overlay the second plot:

    # sample data
    H <- sample(seq(0.455,0.475,0.001), 1000, replace = TRUE)
    
    #plot histogram
    hist(H)
    
    # prepare graphics to add second plot
    par(new = TRUE)
    
    # sample data for second plot
    X <- seq(0.455,0.475, 0.001)
    Y <- dbeta(X,2835,2835)
    
    # plot second plot, remove axes
    plot(X, dbeta(X,2835,2835), type="l", col="red", axes = FALSE)
    axis(4, Y) # add axis on right side
    

    enter image description here