Search code examples
objective-cstringalgorithmpseudocode

Algorithm to group consecutive words minimizing length per group


From an input of space-delimited words, how to concatenate consecutive words so that:

  • each group has a minimum length L (spaces don't count)
  • longest group length is minimal (spaces don't count)

Example input:

would a cat eat a mouse

Example minimum length:

L = 5

Naive algorithm that solves the first condition but not the second one:

while length of a group is less than L, concatenate next word to group
if last group is shorter than L, concatenate last two groups together

This naive algorithm produces:

  • group 1: would
  • group 2: acateat
  • group 3: amouse
  • longest group length: 7

Second condition is not solved because a better solution would be:

  • group 1: woulda
  • group 2: cateat
  • group 3: amouse
  • longest group length: 6

Which algorithm would solve the second condition (minimal longest group) with relatively fast execution as a program? (by fast, I'd like to avoid testing all possible combinations)

I know C, ObjC, Swift, Javascript, Python, but pseudocode is fine.


Solution

  • This can be done with dynamic programming approach. Let's count a function F(i) - the minimum length of the longest group among correct divisions of the first i words into groups.

    F(0) = 0
    F(i) = Min(Max(F(j), totalLen(j+1, i))), for j in [0..i-1] 
    

    Where

    totalLen(i, j) = total length of words from i to j, if the length is at least L 
    totalLen(i, j) = MAX, if total length is less than L 
    

    The answer is the value of F(n). To get the groups themselves we can save the indices of the best j for every i.

    There is a implementation from the scratch in c++:

    const vector<string> words = {"would", "a", "cat", "eat", "a", "mouse"};
    const int L = 5;
    int n = words.size();
    
    vector<int> prefixLen = countPrefixLen(words);
    
    vector<int> f(n+1);
    vector<int> best(n+1, -1);
    int maxL = prefixLen[n];
    f[0] = 0;
    
    for (int i = 1; i <= n; ++i) {
        f[i] = maxL; 
        for (int j = 0; j < i; ++j) {
            int totalLen = prefixLen[i] - prefixLen[j];
            if (totalLen >= L) {
                int maxLen = max(f[j], totalLen);
                if (f[i] > maxLen) {
                    f[i] = maxLen;
                    best[i] = j;
                }
            }
        } 
    }
    
    output(f[n], prev, words);  
    

    Preprocessing and output details:

    vector<int> countPrefixLen(const vector<string>& words) {
        int n = words.size();
        vector<int> prefixLen(n+1);
        for (int i = 1; i <= n; ++i) {
            prefixLen[i] = prefixLen[i-1] + words[i-1].length();
        } 
        return prefixLen;
    }
    
    void output(int answer, const vector<int>& best, const vector<string>& words) {
        cout << answer << endl;
    
        int j = best.size()-1;
        vector<int> restoreIndex(1, j);
        while (j > 0) {
            int i = best[j];
            restoreIndex.push_back(i);
            j = i;
        }
        reverse(restoreIndex.begin(), restoreIndex.end());
        for (int i = 0; i+1 < restoreIndex.size(); ++i) {
            for (int j = restoreIndex[i]; j < restoreIndex[i+1]; ++j) {
                cout << words[j] << ' ';
            }
            cout << endl;
        }
    }
    

    Output:

    6
    would a 
    cat eat 
    a mouse 
    

    Runnable: https://ideone.com/AaV5C8

    Further improvement

    The complexity of this algorithm is O(N^2). If it is too slow for your data I can suggest a simple optimization:

    Let's inverse the inner loop. First, this allows to get rid of the prefixLen array and it's preprocessing, because now we add words one by one to the group (actually, we could get rid of this preprocessing in the initial version, but at the expense of simplicity). What is more important we can break our loop when totalLen would be not less than already computed f[i] because further iterations will never lead to an improvement. The code of the inner loop could be changed to:

    int totalLen = 0;
    for (int j = i-1; j >= 0; --j) {
        totalLen += words[j].length();
        if (totalLen >= L) {
            int maxLen = max(f[j], totalLen);
            if (f[i] > maxLen) {
                f[i] = maxLen;
                best[i] = j;
            }
        }
        if (totalLen >= f[i]) break;
    } 
    

    This can drastically improve the performance for not very big values of L.