Scala String Similarity

I have a Scala code that computes similarity between a set of strings and give all the unique strings.

val filtered = z.reverse.foldLeft((List.empty[String],z.reverse)) {
    case ((acc, zt), zz) =>
      if (zt.tail.exists(tt => similarity(tt, zz) < threshold)) acc 
      else zz :: acc, zt.tail
  }._1

I'll try to explain what is going on here :

This uses a fold over the reversed input data, starting from the empty String (to accumulate results) and the (reverse of the) remaining input data (to compare against - I labeled it zt for "z-tail").

The fold then cycles through the data, checking each entry against the tail of the remaining data (so it doesn't get compared to itself or any earlier entry)

If there is a match, just the existing accumulator (labelled acc) will be allowed through, otherwise, add the current entry (zz) to the accumulator. This updated accumulator is paired with the tail of the "remaining" Strings (zt.tail), to ensure a reducing set to compare against.

Finally, we end up with a pair of lists: the required remaining Strings, and an empty list (no Strings left to compare against), so we take the first of these as our result.

The problem is like in first iteration, if 1st, 4th and 8th strings are similar, I am getting only the 1st string. Instead of it, I should get a set of (1st,4th,8th), then if 2nd,5th,14th and 21st strings are similar, I should get a set of (2nd,5th,14th,21st).

Solution

If I understand you correctly - you want the result to be of type List[List[String]] and not the List[String] you are getting now - where each item is a list of similar Strings (right?).

If so - I can't see a trivial change to your implementation that would achieve this, as the similar values are lost (when you enter the if(true) branch and just return the acc - you skip an item and you'll never "see" it again).

Two possible solutions I can think of:

Based on your idea, but using a 3-Tuple of the form (acc, zt, scanned) as the foldLeft result type, where the added scanned is the list of already-scanned items. This way we can refer back to them when we find an element that doesn't have preceeding similar elements:

val filtered = z.reverse.foldLeft((List.empty[List[String]],z.reverse,List.empty[String])) {
  case ((acc, zt, scanned), zz) =>
    val hasSimilarPreceeding = zt.tail.exists { tt => similarity(tt, zz) < threshold }
    val similarFollowing = scanned.collect { case tt if similarity(tt, zz) < threshold => tt }
    (if (hasSimilarPreceeding) acc else (zz :: similarFollowing) :: acc, zt.tail, zz :: scanned)
}._1

A probably-slower but much simpler solution would be to just groupBy the group of similar strings:

val alternative = z.groupBy(s => z.collect { 
  case other if similarity(s, other) < threshold => other 
}.toSet ).values.toList

All of this assumes that the function:

f(a: String, b: String): Boolean = similarity(a, b) < threshold

Is commutative and transitive, i.e.:

f(a, b) && f(a. c) means that f(b, c)
f(a, b) if and only if f(b, a)

To test both implementations I used:

// strings are similar if they start with the same character
def similarity(s1: String, s2: String) = if (s1.head == s2.head) 0 else 100

val threshold = 1
val z = List("aa", "ab", "c", "a", "e", "fa", "fb")

And both options produce the same results:

List(List(aa, ab, a), List(c), List(e), List(fa, fb))