Search code examples
rmachine-learningrandom-forestunsupervised-learningmultilabel-classification

Unsupervised Classification: Assign classes to to data


I have a set of data from a drill hole, it contains information about different geomechanical properties every 2 meters. I am trying to create geomechanical domains, and assign each point to a different domain.

I am trying to use random forest classification, and am unsure how to relate the proximty matrix (or any result from the randomForest function) to labels.

My humble code so far is as follows:

dh <- read.csv("gt_1_classification.csv", header = T)

#replace all N/A with 0
dh[is.na(dh)] <- 0
library(randomForest)
dh_rf <- randomForest(dh, importance=TRUE, proximity=FALSE, ntree=500, type=unsupervised, forest=NULL)

I would like the classifier to decide the domains on its own.

Any help would be great!


Solution

  • Hack-R is correct -- first it is necessary to explore the data using some clustering (unsupervised learning) methods. I've provided some sample code using the R built-in mtcars data as a demonstration:

    # Info on the data
    ?mtcars
    head(mtcars)
    pairs(mtcars)    # Matrix plot
    
    # Calculate the distance between each row (car with it's variables)
    # by default, Euclidean distance = sqrt(sum((x_i - y_i)^2)
    ?dist
    d <- dist(mtcars)
    d # Potentially huge matrix
    
    # Use the distance matrix for clustering
    # First we'll try hierarchical clustering
    ?hclust
    c <- hclust(d)
    c
    
    # Plot dendrogram of clusters
    plot(c)
    
    # We might want to try 3 clusters
    # need to specify either k = # of groups
    groups3 <- cutree(c, k = 3) # "g3" = "groups 3"
    # cutree(hcmt, h = 230) will give same result
    groups3
    # Or we could do several groups at once
    groupsmultiple <- cutree(c, k = 2:5)
    head(groupsmultiple)
    
    # Draw boxes around clusters
    rect.hclust(c, k = 2, border = "gray")
    rect.hclust(c, k = 3, border = "blue")
    rect.hclust(c, k = 4, border = "green4")
    rect.hclust(c, k = 5, border = "darkred")
    
    # Alternatively we can try K-means clustering
    # k-means clustering
    ?kmeans
    km <- kmeans(mtcars, 5)
    km
    
    # Graph based on k-means
    install.packages("cluster")
    require(cluster)
    clusplot(mtcars, # data frame
         km$cluster, # cluster data
         color = TRUE, # color
         lines = 3, # Lines connecting centroids
         labels = 2) # Labels clusters and cases
    

    After running on your own data, consider which definition of clusters captures the level of similarity of interest to you. You can then create a new variable with a "level" for each cluster and then create a supervised model to that.

    Here's a decision tree example using the same mtcars data. NOTE that here I used mpg as the response -- you would want to use your new variable based on the clusters.

    install.packages("rpart")
    library(rpart)
    ?rpart
    # grow tree 
    tree_mtcars <- rpart(mpg ~ ., method = "anova", data = mtcars)
    tree_mtcars <- rpart(mpg ~ ., data = mtcars)
    
    tree_mtcars
    
    summary(tree_mtcars) # detailed summary of splits
    
    # Get R-squared
    rsq.rpart(tree_mtcars)
    ?rsq.rpart
    
    # plot tree 
    plot(tree_mtcars, uniform = TRUE, main = "Regression Tree for mpg ")
    text(tree_mtcars, use.n = TRUE, all = TRUE, cex = .8)
    text(tree_mtcars, use.n = TRUE, all = TRUE, cex = .8)
    

    Note that the although very informative, a basic decision tree is often not great for prediction. If prediction is desirable, other models should also be explored.