I'd like to create a group variables based upon how similar a selection of names is. I have started by using the stringdist package to generate a measure of distance. But I'm not sure how to use that output information to generate a group by variable. I've looked at hclust but it seems like to use clustering functions you need to know how many groups you want in the end, and I do not know that. The code I start with is below:
name_list <- c("Mary", "Mery", "Mary", "Joe", "Jo", "Joey", "Bob", "Beb", "Paul")
name_dist <- stringdistmatrix(name_list)
name_dist
name_dist2 <- stringdistmatrix(name_list, method="soundex")
name_dist2
I would like to see a dataframe with two columns that look like
name = c("Mary", "Mery", "Mary", "Joe", "Jo", "Joey", "Bob", "Beb", "Paul")
name_group = c(1, 1, 1, 2, 2, 2, 3, 3, 4)
The groups might be slightly different depending obviously on what distance measure I use (I've suggested two above) but I would probably choose one or the other to run.
Basically, how do I get from the distance matrix to a group variable without knowing the number of clusters I'd like?
You could use a cluster analysis like this:
# loading the package
require(stringdist);
# Group selection by class numbers or height
num.class <- 5;
num.height <-0.5;
# define names
n <- c("Mary", "Mery", "Mari", "Joe",
"Jo", "Joey", "Bob", "Beb", "Paul");
# calculate distances
d <- stringdistmatrix(n, method="soundex");
# cluster the stuff
h <- hclust(d);
# cut the cluster by num classes
m <- cutree(h, k = num.class);
# cut the cluster by height
p <- cutree(h, h = num.height);
# build the resulting frame
df <- data.frame(names = n,
group.class = m,
group.prob = p);
It produces:
df;
names group.class group.prob
1 Mary 1 1
2 Mery 1 1
3 Mari 1 1
4 Joe 2 2
5 Jo 2 2
6 Joey 2 2
7 Bob 3 3
8 Beb 4 3
9 Paul 5 4
And the chart gives you an overview:
plot(h, labels=n);
Regards huck