Search code examples
perlsortinglevenshtein-distance

perl custom sort by string similarity clustering


In Perl, I would like to sort a collection of different length strings in a way that automatically lumps together similar strings.

Intuitively, I imagine I need some distance measure for each pair and then a clustering routine that groups by the distance.

My number of strings is always small and they are short, see an example below.

Is there a simple approach that will do what I need in sort_magic_here?

#!/usr/bin/perl
use strict;

my @list =
  ("JK_HJ_Lancaster", "SY4_TS_HJ_1000ng",
   "NB_E_200cc_caHJ_Rep1", "HB_E_100cc_caHJ_Rep1",
   "HB_E_200cc_caHJ_Rep1", "Normal_Lancaster",
   "NB15_OP_HJ_1000ng","Zoey_HJ_Slough",
   "NB_E_100cc_caHJ_Rep1","Normal_Slough",
   "JK_caHJ_Slough","Zoey_HJ_Lancaster");

print "# Straight sort\n";
foreach my $elem (sort @list) {
  print "$elem\n";
}

print "# Sort grouped by string distance\n";
foreach my $elem (sort { sort_magic_here() }  @list) {
  print "$elem\n";
}

Solution

  • Custom sorts take two inputs, perform a 'comparison' and respond with -1, 0 or 1 depending on whether they are before, after or equal.

    Sorting is designed for making a positional order, not really for 'grouping stuff that's vaguely similar'.

    You do have the Text::Levenshtein module which quickly lets you compute that - but you have to do something altogether more complicated because you'd need to compare each word to each other word before being able to decide ordering. But frankly, you will have the same problem with any 'similar words' sort of comparison.

    In this, you're starting to look at graph theory and grouping based on that. It's quite a complicated problem though - it's far from as trivial as 'just' sorting.

    I'd be looking at something like:

    #!/usr/bin/perl
    use strict;
    use warnings;
    
    use Text::Levenshtein qw ( distance );
    use Data::Dumper;
    
    my @list = (
        "JK_HJ_Lancaster",      "SY4_TS_HJ_1000ng",
        "NB_E_200cc_caHJ_Rep1", "HB_E_100cc_caHJ_Rep1",
        "HB_E_200cc_caHJ_Rep1", "Normal_Lancaster",
        "NB15_OP_HJ_1000ng",    "Zoey_HJ_Slough",
        "NB_E_100cc_caHJ_Rep1", "Normal_Slough",
        "JK_caHJ_Slough",       "Zoey_HJ_Lancaster"
    );
    
    my %distances;
    
    foreach my $elem (@list) {
        foreach my $compare (@list) {
            next if $elem eq $compare;
            my $distance = distance( $elem, $compare );
            $distances{$elem}{$compare} = $distance;
        }
    }
    
    print Dumper \%distances;
    
    my %seen;
    my ($cursor) = sort @list;
    
    while ($cursor) {
        print "$cursor\n";
        $seen{$cursor}++;
        my @near_words_in_order =
            sort { $distances{$cursor}{$a} <=> $distances{$cursor}{$b} }
            keys %{ $distances{$cursor} };
    
        #      print @near_words_in_order;
        last unless @near_words_in_order;
        while ( $seen{$cursor} ) {
            $cursor = shift(@near_words_in_order) // 0;
        }
    }
    

    Which gives the result:

    HB_E_100cc_caHJ_Rep1
    HB_E_200cc_caHJ_Rep1
    NB_E_200cc_caHJ_Rep1
    NB_E_100cc_caHJ_Rep1
    NB15_OP_HJ_1000ng
    SY4_TS_HJ_1000ng
    Zoey_HJ_Slough
    JK_caHJ_Slough
    Normal_Slough
    Normal_Lancaster
    JK_HJ_Lancaster
    Zoey_HJ_Lancaster
    

    Which at least approximately groups like you request. You can probably get this more efficient, because you don't need to compute all the distances which'll reduce the algorithm complexity. But you also will get different groups based on proximity and start point.