I have a set of data from a drill hole, it contains information about different geomechanical properties every 2 meters. I am trying to create geomechanical domains, and assign each point to a different domain.
I am trying to use random forest classification, and am unsure how to relate the proximty matrix (or any result from the randomForest function) to labels.
My humble code so far is as follows:
dh <- read.csv("gt_1_classification.csv", header = T)
#replace all N/A with 0
dh[is.na(dh)] <- 0
library(randomForest)
dh_rf <- randomForest(dh, importance=TRUE, proximity=FALSE, ntree=500, type=unsupervised, forest=NULL)
I would like the classifier to decide the domains on its own.
Any help would be great!
Hack-R is correct -- first it is necessary to explore the data using some clustering (unsupervised learning) methods. I've provided some sample code using the R built-in mtcars data as a demonstration:
# Info on the data
?mtcars
head(mtcars)
pairs(mtcars) # Matrix plot
# Calculate the distance between each row (car with it's variables)
# by default, Euclidean distance = sqrt(sum((x_i - y_i)^2)
?dist
d <- dist(mtcars)
d # Potentially huge matrix
# Use the distance matrix for clustering
# First we'll try hierarchical clustering
?hclust
c <- hclust(d)
c
# Plot dendrogram of clusters
plot(c)
# We might want to try 3 clusters
# need to specify either k = # of groups
groups3 <- cutree(c, k = 3) # "g3" = "groups 3"
# cutree(hcmt, h = 230) will give same result
groups3
# Or we could do several groups at once
groupsmultiple <- cutree(c, k = 2:5)
head(groupsmultiple)
# Draw boxes around clusters
rect.hclust(c, k = 2, border = "gray")
rect.hclust(c, k = 3, border = "blue")
rect.hclust(c, k = 4, border = "green4")
rect.hclust(c, k = 5, border = "darkred")
# Alternatively we can try K-means clustering
# k-means clustering
?kmeans
km <- kmeans(mtcars, 5)
km
# Graph based on k-means
install.packages("cluster")
require(cluster)
clusplot(mtcars, # data frame
km$cluster, # cluster data
color = TRUE, # color
lines = 3, # Lines connecting centroids
labels = 2) # Labels clusters and cases
After running on your own data, consider which definition of clusters captures the level of similarity of interest to you. You can then create a new variable with a "level" for each cluster and then create a supervised model to that.
Here's a decision tree example using the same mtcars data. NOTE that here I used mpg as the response -- you would want to use your new variable based on the clusters.
install.packages("rpart")
library(rpart)
?rpart
# grow tree
tree_mtcars <- rpart(mpg ~ ., method = "anova", data = mtcars)
tree_mtcars <- rpart(mpg ~ ., data = mtcars)
tree_mtcars
summary(tree_mtcars) # detailed summary of splits
# Get R-squared
rsq.rpart(tree_mtcars)
?rsq.rpart
# plot tree
plot(tree_mtcars, uniform = TRUE, main = "Regression Tree for mpg ")
text(tree_mtcars, use.n = TRUE, all = TRUE, cex = .8)
text(tree_mtcars, use.n = TRUE, all = TRUE, cex = .8)
Note that the although very informative, a basic decision tree is often not great for prediction. If prediction is desirable, other models should also be explored.