Search code examples
algorithmstringmathcomputer-sciencecombinatorics

Combinatorics : Grouping Characters Challenges


I was working on some grouping problems at my work. So we have a bunch of characters, here I have taken a i d s.

  1. What are the ways we can group the elements ? Let us say we have 4 characters a i d s. Valid groups (preserving the order) would be:

    a i d s
    a i ds
    a id s
    ai d s
    ai ds
    a ids
    aid s
    aids

How do you enumerate all the groups? How many combinations are there for any n letters?

  1. Special Cases

    • What if the case made a difference like Ai sd and ai sd are two groups?

    • How much time would it take you to enumerate all the cases? What would be the time difference between finding the 4 letters and 5 letters case?

    • If you take the "blank space" as a character. After all the enumeration, how many characters would you have written?

    • If you define a transformation from a word to another word in terms of distance. Say "ai ds" and "a i ds" has 1 distance because you should moved the letter "i" one step. Could you find the words at n distance in either side of any word?

Edit:

"aids" is a single word.
"a ids" "aid s" are two words which are at 1 distance from the original word "aids".
"a id s" is a word which is two distance from the original word "aids".
"a i d s" is a word which is three distance from the word.

This problem seems to be gold mine.

Bonus: What is the smallest distance between two words. Like "aids" is one distance from "a ids" and also two distance. Is there a "midpoint" word from where you can reach any other word in the enumeration with least distance? How many paths are there from a word to another?


Solution

  • Computing the number of combinations is trivial. Basically, you have a character between each two letters. It might be "epsilon" (empty), or space. So, for n letters you have n-1 such separator characters. As each character can only have two values, that's the same as a binary number of n-1 digits. So you can have 2 to the power of n-1 combinations.

    aids = 4 characters (n = 4)
    n - 1 = 3
    2 ** (n-1) = 8 combinations
    

    Now, for the special cases. If each character can be either lowercase or uppercase, that's 2 to the power of n variations for the characters (as long as they are letters). Now, EACH combination above has 2n variations based on capitalization, so the end result is (2(n-1)) * (2**n), which is equal to 2 ** (2*n -1).

    For each aditional letter you either double or quadruple (depending on the capitalization thing) the time taken to enumerate them, as can be easily understood from the formula.

    The total number of characters is a bit more difficult, but suffice to notice that each "space" is "epsilon" half the time. So we have (2 ** (n-1)) * n (letters) + ((2 ** (n-1)) * (n-1)) / 2 (spaces). In the example:

    (2 ** 3) * 4 = 32 characters
    (2 ** 3) * 3 / 2 = 12 spaces
    Total = 44 characters
    

    Finally, the distance problem is related to the Levenshtein distance. I thought about using a Burkhard-Keller tree, but I see now this is not necessary at all, as the problem is rather simpler.

    First, the minimum number of inserts/deletions/changes necessary to make one string equal to another is called the Levenshtein distance. This is directly applicable to the problem: you add spaces, delete spaces, change lowercase/uppercase as needed. Usually, this is best solved with Dynamic Programming, which can be generally thought of as keeping data about the solution to small parts of the problem, which then get reused computing other parts/bigger parts.

    But given the particular restrictions of our problem, we can just compare the "binary" numbers representing spaces/epsilon.

    Say a function f(word) will return the binary number representing spaces in that word. For example, it will return "010" for "ai ds" and "111" for "a i d s". The number of changes between each combination is given by XORing the results of f (010 xor 111 = 101), and then counting the number of bits equal to 1. Let's write a couple of functions in Scala for that:

    def spacesToBinary(s: String) = {
      (s zip s.drop(1)) // transform "ai ds" into ((a,i), (i, ), ( ,d), (d, s))
      .foldLeft(0) { (binary, pair) => pair match {
          case (' ', _) => binary            // A space is always followed by a letter,
                                             // so we ignore it
          case (_, ' ') => (binary << 1) | 1 // If a letter is followed by a space, bit = 1
          case _ => binary << 1              // If not, bit = 0
      }}
    }
    
    def numberOfOnes(d: Int) = {
      var bits = 0
      var n = d
      while(n != 0) {
        if ((n & 1) == 1)
          bits += 1
        n >>= 1
      }
      bits
    } 
    
    def spaceDistance(s1: String, s2: String) = 
      numberOfOnes(spacesToBinary(s1) ^ spacesToBinary(s2))
    

    Now, comparing lowercase/uppercase is quite easy, once we discard spaces:

    def caseDistance(s1: String, s2: String) = {
      val ns1 = s1 filter (_ != ' ')
      val ns2 = s2 filter (_ != ' ')
      (ns1 zip ns2).foldLeft(0){(acc, pair) => acc + (if (pair._1 == pair._2) 0 else 1)}
    }
    

    So, the Levenshtein distance is:

    def levenshtein(s1: String, s2: String) = spaceDistance(s1, s2) + caseDistance(s1, s2)
    

    We also know the following properties about the Levenshtein distance. Say d(x,y) is the Levenshtein distance between x and y. Then we know:

    d(x, y) = 0 if, and only if, x = y
    d(x, y) = d(y, x)
    d(x, y) + d(y, z) >= d(x, z)
    

    The last criteria i known as the triange inequality. Simply put, the path from x to z is at least as small as the path from x to y plus the path from y to z (think a triangle with vertices x, y and z).

    Ok, so let's think about the bonus questions.

    How many paths there are between two words? Well, if the Levenshtein distance is n, that means you have "n" unique modifications, a1, a2, ..., an. For each different ordering of these modifications you have a path. The number of permutations of "n" elements is the factorial of "n" (or n!):

    def factorial(n: Int): Int = n match {
      case 0 => 1
      case _ => n * factorial(n-1)
    }
    
    2! = 2
    3! = 6
    4! = 24
    5! = 120
    etc
    

    Is there a "central" combination? Actually, no. If we go back and think of combinations as pairs of binary numbers representing the spaces/no-spaces and lower/uppercases, then it should be obvious that you can simply invert the bits to produce a new combination whose distance to the one chosen is the maximum possible.

    Or, in other words, for every combination of n letters, there is one, and only one, corresponding combination, so that the Levenshtein distance between the two combinations is 2*n - 1, the maximum possible distance between any two combinations.

    I see I forgot to compute all combinations whose (minimum) distance to s is n. Well, we know each combination can be represented as two binary numbers, representing the spaces and the capitalization of each letter, the first being n-1 digits long, and the second n digits long.

    We also know that we can simply invert any of these "digits" to get a difference. So, if we get one big binary number 2*n - 1 digits long, and we enumerate all its digits, the combinations at a minimum distance d from it are given by the combination of the 2*n-1 digits in groups of size "d" with no repetitions. For N=2*n-1, the number of such combination is N!/((N-d)! * d!).

    For example, for distance 2 and "aids", n=4, d=2, N=2*4-1=7, and the number of combinations is 7!/(5!*2!) = 7 * 6 / 2 = 21.

    We can compose the combinations this way:

    def computeCombinations(n: Int, d: Int): List[List[Int]] = {
      def gen(d: Int, l: List[Int]): List[List[Int]] = d match {
        case 0 => List(List())
        case _ => for(el <- l;
                      sl <- gen(d-1, l.dropWhile(_ != el).tail))
                  yield el :: sl
      }
    
      gen(d, (0 until n).toList)
    }
    

    This will return lists of lists of letter/spaces to change. We indicate which letter or space needs changing through the number of the bit we want to toggle. To simplify things, we assume the binary number for the capitalization and the binary number for the space/no-space are concatenated in a single binary number.

    Next, we have to find a way to produce the variations from this information. The upper/lowercase is simple, assuming we receive the word without spaces:

    def toggleCharCase(c: Char) = (c ^ ('A' ^ 'a')).toChar
    def toggleCases(s: String, bits: List[Int]) = (
      s.zipWithIndex
      map (t => if (Set(bits: _*) contains t._2) (toggleCharCase(t._1), t._2) else t)
      map (_._1)
    )
    

    Toggling spaces is more difficult. We'll use spacesToBinary, defined above, convert that into a list of bit numbers which are set, toggle the requested bits and return that. Afterwards, we use a different function to insert the spaces in the proper places:

    def toggleSpaces(s: String, bits: List[Int]): List[Int] = {
      val before = spacesToBinary(s)
      val beforeBits = Set() ++ 
                       (for(i <- 0 to 30; // Assuming 32 bits, 1 for sign
                            if (scala.Math.pow(2, i).toInt & before) != 0)
                        yield i)
      val afterBits = (beforeBits union Set(bits: _*)) diff 
                      (beforeBits intersect Set(bits : _*))
      afterBits.toList
    }
    
    def addSpaces(s: String, bits: List[Int]): String = (
      s.filter(_ != ' ').zipWithIndex 
      map (t => t._1.toString + (if (bits contains t._2) " " else ""))
      mkString
    )
    

    Now we have to compute the space difference, remove the spaces, toggle cases, and then add the spaces back. Let's see:

    def changeCombination(s: String, n: Int, bits: List[Int]) = {
      // Split the binary number with all changes into case changes and space changes
      val spaceBits = bits filter (_ >= n) map (_ - n)
      val caseBits = bits filter (_ < n)
    
      // Compute the set of spaces after changing them
      val newSpaces = toggleSpaces(s, spaceBits)
    
      // Remove spaces and toggle case
      val noSpaces = s filter (_ != ' ')
      val caseChanged = toggleCases(noSpaces, caseBits).mkString
    
      // Now add the spaces as computed before
      val spacesAdded = addSpaces(caseChanged, newSpaces).mkString
      spacesAdded
    }
    

    Finally, we compute all combinations, and then produce a modified string for each one:

    def makeCombinations(s: String, d: Int) = {
      val n = (s filter (_ != ' ')).length
      for(combination <- computeCombinations(n*2-1, d))
      yield changeCombination(s, n, combination)
    }
    

    This code has all been tested on Scala 2.8 (except for some renaming I just made). Here is the result of a run:

    scala> makeCombinations("ai ds", 2) foreach println
    AI ds
    Ai Ds
    Ai dS
    A i ds
    Aids
    Ai d s
    aI Ds
    aI dS
    a I ds
    aIds
    aI d s
    ai DS
    a i Ds
    aiDs
    ai D s
    a i dS
    aidS
    ai d S
    a ids
    a i d s
    aid s