rknnldakruskal-wallispairwise.wilcox.test

# How to find meaningful boundaries between two continuous variables in R

To find the relationship between two columns of the iris dataset, I am performing kruskal.test and p.value shows a meaningful relationship between these two columns.

``````data(iris)
kruskal.test(iris\$Petal.Length, iris\$Sepal.Width)
``````

Here are the results:

``````    Kruskal-Wallis rank sum test

data:  iris\$Petal.Length and iris\$Sepal.Width
Kruskal-Wallis chi-squared = 41.827, df = 22, p-value = 0.00656
``````

The Scatter plot also shows some sort of relationship. `plot(iris\$Petal.Length, iris\$Petal.Width)`

To find the meaningful boundaries of these two variables, I ran `pairwise.wilcox.test` test, but for this test to work, one of the variables needs to be categorical. If I pass both continuous variables to it, then the results are not as expected.

``````pairwise.wilcox.test(x = iris\$Petal.Length, g = iris\$Petal.Width, p.adjust.method = "BH")
``````

As an output, I need a clear cut point where these two variables have some sort of relationship and where this relationship ends (As shown through the red line in the attached image above)

I am not sure if there is any statistical test or another programming technique to find these boundaries.

e.g. manually I can do something like this to mark boundaries -

``````setDT(iris)[, relationship := ifelse(Petal.Length > 3 & Sepal.Width < 3.5, 1, 0)]
``````

But, is there a programming technique or library in R to find such boundaries?

It is important to note that my actual data is skewed.

Thanks, Saurabh

Solution

• There is not sth like the best split. It could be the best under certain conditions/criteria you will specify.

I think you expected second plot although I added the first one too where you have one line. There is used a Linear Discriminant Analysis. However this is supervised learning as we have Species column. So you might be interested in unsupervised methods like K-Nearest Neighborhoods and boundaries for them - then check this one https://stats.stackexchange.com/questions/21572/how-to-plot-decision-boundary-of-a-k-nearest-neighbor-classifier-from-elements-o.

``````data(iris)
library(MASS)

plot(iris\$Petal.Length, iris\$Petal.Width, col = iris\$Species)

# construct the model
mdl <- lda(Species ~ Petal.Length + Petal.Width, data = iris)

# draw discrimination line
np <- 300
nd.x <- seq(from = min(iris\$Petal.Length), to = max( iris\$Petal.Length), length.out = np)
nd.y <- seq(from = min(iris\$Petal.Width), to = max( iris\$Petal.Width), length.out = np)
nd <- expand.grid(Petal.Length = nd.x, Petal.Width = nd.y)

prd <- as.numeric(predict(mdl, newdata = nd)\$class)

plot(iris[, c("Petal.Length", "Petal.Width")], col = iris\$Species)
points(mdl\$means, pch = "+", cex = 3, col = c("black", "red"))
contour(x = nd.x, y = nd.y, z = matrix(prd, nrow = np, ncol = np),
levels = c(1, 2), add = TRUE, drawlabels = FALSE)

#create LD sequences from min - max values
p = predict(mdl, newdata= nd)
p.x = seq(from = min(p\$x[,1]), to = max(p\$x[,1]), length.out = np) #LD1 scores
p.y = seq(from = min(p\$x[,2]), to = max(p\$x[,2]), length.out = np) #LD2 scores

contour(x = p.x, y = p.y, z = matrix(prd, nrow = np, ncol = np),
levels = c(1, 2, 3), add = TRUE, drawlabels = FALSE)
``````