Search code examples
rimportdatasetcluster-analysisk-means

RStudio: get_dist() error message "'x' must be numeric" following clustering guide?


I'm pretty new to R so I was following a guide for cluster analysis, and when I get to using get_dist() I keep getting the error Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric. When I remove the column with the <chr> data, it works fine, but the thing is, I want to keep these labels in, like the "state" labels in the USArrests dataset.

I found a question that was pretty similar to mine over here, however there were no comments or answers that were helpful for me. I've seen a few posts, such as this one that mention trying get_dist(x$x) or as.numeric(as.character(x$x)), but I must admit that this work around doesn't make much sense, nor have I had much success implementing these suggestions.

I can't show my full data set, but I can provide the results of head(), and I have noticed that it differs from head(USArrests):

library(readxl)
Mother_2_ABS_Summer_2019_clean <- read_excel("~/.../Mother_2_ABS_Summer_2019_clean.xls", 
    range = "D1:H61")
head(Mother_2_ABS_Summer_2019_clean)

...1     Audience     Genre     Structure     Proofreading
<chr>    <dbl>        <dbl>     <dbl>         <dbl>
ABS-P_29_S31    2   2   2.0 3
ABS_40_S50  3   3   3.5 3
ABS_57_S47  2   2   2.0 3
ABS_86_S48  4   3   3.0 4
ABS_143_S42 2   2   2.0 3
ABS-P_152_S49   2   1   1.0 4

head(USArrests)

         Murder     Assault     UrbanPop     Rape
        <dbl>       <int>       <int>        <dbl>
Alabama 13.2    236 58  21.2
Alaska  10.0    263 48  44.5
Arizona 8.1 294 80  31.0
Arkansas    8.8 190 50  19.5
California  9.0 276 91  40.6
Colorado    7.9 204 78  38.7

So what I've noticed is that in USArrests, the state labels aren't categorized as <chr> unlike my identifications for the documents.

When I follow the guide, I have no problems up until get_dist():

dat1 <- na.omit(Mother_2_ABS_Summer_2019_clean)
dat1 <- scale(dat1)

distance <- get_dist(dat1)
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))

Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric

When I import only the the 4 columns that contain numeric data, and go through the guide, everything works just fine and I can view the cluster results. The problem here is I want to see the visualizations WITH the document identifications, otherwise the results don't mean to much when looking at them.

If any of you have any advice or suggestions, it would be greatly appreciated.


Solution

  • UNTESTED: You could assign those labels as the row names:

    library(tidyverse) Mother_2_ABS_Summer_2019_clean %>% remove_rownames %>% column_to_rownames(var="...1")

    Maybe consider changing the first column name so the above is cleaner and more likely to work. Then it's the same format as the USArrests.