Search code examples
rscalemeanstandard-deviation

colMeans is not functioning in R


I need to do this for my assignment: We focus on the following subset of the variables: regime, oil, logGDPcp, and illit. Remove observations that have missing values in any of these variables. Using the scale() function, scale these variables so that each variable has a mean of zero and a standard deviation of one. Fit the k-means clustering algorithm with two clusters. How many observations are assigned to each cluster? Using the original unstandardized data, compute the means of these variables in each cluster. This is what I did

resources <- read.csv("https://raw.githubusercontent.com/umbertomig/intro-prob-stat-FGV/master/datasets/resources.csv")

#subset
resources.subset <- subset(resources, select = c("cty_name", "year", "regime", "oil", "logGDPcp", "illit"))

#removing missing values
resources1 <- na.omit(resources.subset)

#scaling
scaled.resources <- scale(resources1)
#mean of zero
colMeans(scaled.resources) 
#standard deviation of 1
apply(scaled.resources, 2, sd)

#fitting into two clusters
cluster2 <- kmeans(resources.scaled, centers = 2)

#how many observations are assigned to each cluster?
nrow(resources.scaled)
table(cluster2$cluster)

#means of the variables
cluster2$centers
g1 <- resources1[cluster2$cluster == 1, ]
colMeans(g1)
g2 <- resources1[cluster2$cluster == 2, ]
colMeans(g2)

But I get this error" Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric

How can I solve this?


Solution

  • There is one column which is not numeric

    str(resources1)
    #'data.frame':  417 obs. of  6 variables:
    # $ cty_name: chr  "United Arab Emirates" "Argentina" "Argentina" "Argentina" ...
    # $ year    : int  1975 1970 1975 1980 1985 1990 1995 1997 1970 1970 ...
    # $ regime  : num  -7 -9 6 -9 8 8 8 8 -7 -2 ...
    # $ oil     : num  65.9386 0.0241 0.0279 0.361 0.6939 ...
    # $ logGDPcp: num  9.71 7.64 8.07 8.53 8.58 ...
    # $ illit   : num  40.2 7.3 6.5 6.1 5 4.3 3.7 3.5 80.1 89.1 ...
    # - attr(*, "na.action")= 'omit' Named int [1:4113] 1 2 3 4 5 6 7 8 9 10 ...
      ..- attr(*, "names")= chr [1:4113] "1" "2" "3" "4" ...
    

    So, it is may be better to scale only the numeric columns

    i1 <- sapply(resources1, is.numeric)
    scaled.resources <- scale(resources1[i1])