Search code examples
rgroupingfuzzy-comparisonstringdist

How to create groups of like sounding names in R?


I'd like to create a group variables based upon how similar a selection of names is. I have started by using the stringdist package to generate a measure of distance. But I'm not sure how to use that output information to generate a group by variable. I've looked at hclust but it seems like to use clustering functions you need to know how many groups you want in the end, and I do not know that. The code I start with is below:

name_list <- c("Mary", "Mery", "Mary", "Joe", "Jo", "Joey", "Bob", "Beb", "Paul")

name_dist <- stringdistmatrix(name_list)
name_dist
name_dist2 <- stringdistmatrix(name_list, method="soundex")
name_dist2

I would like to see a dataframe with two columns that look like

name = c("Mary", "Mery", "Mary", "Joe", "Jo", "Joey", "Bob", "Beb", "Paul")

name_group = c(1, 1, 1, 2, 2, 2, 3, 3, 4)

The groups might be slightly different depending obviously on what distance measure I use (I've suggested two above) but I would probably choose one or the other to run.

Basically, how do I get from the distance matrix to a group variable without knowing the number of clusters I'd like?


Solution

  • You could use a cluster analysis like this:

    # loading the package
    require(stringdist);
    
    # Group selection by class numbers or height 
    num.class <- 5;
    num.height <-0.5;
    
    # define names 
    n <- c("Mary", "Mery", "Mari", "Joe", 
           "Jo", "Joey", "Bob", "Beb", "Paul");
    
    # calculate distances
    d <- stringdistmatrix(n, method="soundex");
    
    # cluster the stuff
    h <- hclust(d);
    
    # cut the cluster by num classes
    m <- cutree(h, k = num.class);
    
    # cut the cluster by height
    p <- cutree(h, h = num.height);
    
    # build the resulting frame
    df <- data.frame(names = n, 
                     group.class = m, 
                     group.prob = p);
    

    It produces:

    df;
       names group.class group.prob
    1  Mary         1          1
    2  Mery         1          1
    3  Mari         1          1
    4   Joe         2          2
    5    Jo         2          2
    6  Joey         2          2
    7   Bob         3          3
    8   Beb         4          3
    9  Paul         5          4
    

    And the chart gives you an overview:

    plot(h, labels=n);
    

    enter image description here

    Regards huck