r knn lda kruskal-wallis pairwise.wilcox.test

How to find meaningful boundaries between two continuous variables in R

To find the relationship between two columns of the iris dataset, I am performing kruskal.test and p.value shows a meaningful relationship between these two columns.

data(iris)
kruskal.test(iris$Petal.Length, iris$Sepal.Width)

Here are the results:

    Kruskal-Wallis rank sum test

data:  iris$Petal.Length and iris$Sepal.Width
Kruskal-Wallis chi-squared = 41.827, df = 22, p-value = 0.00656

The Scatter plot also shows some sort of relationship. plot(iris$Petal.Length, iris$Petal.Width)

To find the meaningful boundaries of these two variables, I ran pairwise.wilcox.test test, but for this test to work, one of the variables needs to be categorical. If I pass both continuous variables to it, then the results are not as expected.

pairwise.wilcox.test(x = iris$Petal.Length, g = iris$Petal.Width, p.adjust.method = "BH")

As an output, I need a clear cut point where these two variables have some sort of relationship and where this relationship ends (As shown through the red line in the attached image above)

I am not sure if there is any statistical test or another programming technique to find these boundaries.

e.g. manually I can do something like this to mark boundaries -

setDT(iris)[, relationship := ifelse(Petal.Length > 3 & Sepal.Width < 3.5, 1, 0)]

But, is there a programming technique or library in R to find such boundaries?

It is important to note that my actual data is skewed.

Thanks, Saurabh

Solution

There is not sth like the best split. It could be the best under certain conditions/criteria you will specify.

I think you expected second plot although I added the first one too where you have one line. There is used a Linear Discriminant Analysis. However this is supervised learning as we have Species column. So you might be interested in unsupervised methods like K-Nearest Neighborhoods and boundaries for them - then check this one https://stats.stackexchange.com/questions/21572/how-to-plot-decision-boundary-of-a-k-nearest-neighbor-classifier-from-elements-o.

data(iris)
library(MASS)

plot(iris$Petal.Length, iris$Petal.Width, col = iris$Species)

# construct the model
mdl <- lda(Species ~ Petal.Length + Petal.Width, data = iris)

# draw discrimination line
np <- 300
nd.x <- seq(from = min(iris$Petal.Length), to = max( iris$Petal.Length), length.out = np)
nd.y <- seq(from = min(iris$Petal.Width), to = max( iris$Petal.Width), length.out = np)
nd <- expand.grid(Petal.Length = nd.x, Petal.Width = nd.y)

prd <- as.numeric(predict(mdl, newdata = nd)$class)

plot(iris[, c("Petal.Length", "Petal.Width")], col = iris$Species)
points(mdl$means, pch = "+", cex = 3, col = c("black", "red"))
contour(x = nd.x, y = nd.y, z = matrix(prd, nrow = np, ncol = np), 
        levels = c(1, 2), add = TRUE, drawlabels = FALSE)

#create LD sequences from min - max values 
p = predict(mdl, newdata= nd)
p.x = seq(from = min(p$x[,1]), to = max(p$x[,1]), length.out = np) #LD1 scores
p.y = seq(from = min(p$x[,2]), to = max(p$x[,2]), length.out = np) #LD2 scores


contour(x = p.x, y = p.y, z = matrix(prd, nrow = np, ncol = np), 
        levels = c(1, 2, 3), add = TRUE, drawlabels = FALSE)

Linked to: How to plot classification borders on an Linear Discrimination Analysis plot in R