I need to do this for my assignment:
We focus on the following subset of the variables: regime
, oil
, logGDPcp
, and illit
. Remove observations that have missing values in any of these variables. Using the scale()
function, scale these variables so that each variable has a mean of zero and a standard deviation of one. Fit the k-means clustering algorithm with two clusters. How many observations are assigned to each cluster? Using the original unstandardized data, compute the means of these variables in each cluster.
This is what I did
resources <- read.csv("https://raw.githubusercontent.com/umbertomig/intro-prob-stat-FGV/master/datasets/resources.csv")
#subset
resources.subset <- subset(resources, select = c("cty_name", "year", "regime", "oil", "logGDPcp", "illit"))
#removing missing values
resources1 <- na.omit(resources.subset)
#scaling
scaled.resources <- scale(resources1)
#mean of zero
colMeans(scaled.resources)
#standard deviation of 1
apply(scaled.resources, 2, sd)
#fitting into two clusters
cluster2 <- kmeans(resources.scaled, centers = 2)
#how many observations are assigned to each cluster?
nrow(resources.scaled)
table(cluster2$cluster)
#means of the variables
cluster2$centers
g1 <- resources1[cluster2$cluster == 1, ]
colMeans(g1)
g2 <- resources1[cluster2$cluster == 2, ]
colMeans(g2)
But I get this error" Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
How can I solve this?
There is one column which is not numeric
str(resources1)
#'data.frame': 417 obs. of 6 variables:
# $ cty_name: chr "United Arab Emirates" "Argentina" "Argentina" "Argentina" ...
# $ year : int 1975 1970 1975 1980 1985 1990 1995 1997 1970 1970 ...
# $ regime : num -7 -9 6 -9 8 8 8 8 -7 -2 ...
# $ oil : num 65.9386 0.0241 0.0279 0.361 0.6939 ...
# $ logGDPcp: num 9.71 7.64 8.07 8.53 8.58 ...
# $ illit : num 40.2 7.3 6.5 6.1 5 4.3 3.7 3.5 80.1 89.1 ...
# - attr(*, "na.action")= 'omit' Named int [1:4113] 1 2 3 4 5 6 7 8 9 10 ...
..- attr(*, "names")= chr [1:4113] "1" "2" "3" "4" ...
So, it is may be better to scale
only the numeric columns
i1 <- sapply(resources1, is.numeric)
scaled.resources <- scale(resources1[i1])