python algorithm cluster-analysis dynamic-programming

Python sequence cluster exercise

I am working through an exercise in my textbook and am implementing the code in Python to practice dynamic programming. I feel like I am right on the edge of figuring it out, but after many hours, I come here for help.

Basically my code is going through a list of values x, and given a k, break that list into k clusters based on calculating the minimum sum of squared errors (SSE) for a particular cluster.

The code creates a table, calculating the SSE for 1 cluster, 2 clusters, ..., k clusters, if we were to put the cluster parentheses around all variations of values within list[0:1], list[0:2], list[0:3], ..., list[0:n], and choosing the minimum SSE for that particular step in the table.

For example: Given x= [7,6,9,15,18,17,30,28,29] and k=3, we would return clusters (7,6,9)(15,18,17)(30,28,29), which would translate to sum of squared error equal to (4.666)(4.666)(2) for each cluster. So our max SSE would be 4.666 for that clustering on that list.

Now when I try it on my second list x = [52, 101, 103, 101, 6, 5, 7], I should get clustering (52)(101, 103, 101)(6, 5, 7), which should give (0)(2.666)(2) or a max of 2.666, but am not getting that. I believe the error lives in the def f(s, j_down, t) for the 2nd return statement, and how I increment s and t. Hopefully I have not made a silly mistake!

Any help is much appreciated, thank you.

def mean(numbers):
    return float(sum(numbers)) / max(len(numbers), 1)

def sum_square(x):
    if isinstance(x, (int,)):
        return 0
    w = 0
    for i in x:
        w += (i - mean(x))**2
    return w

def f(s, j_down, t):
    if not r[s][j_down] and r[s][j_down] != 0:
        return sum_square(x[:t - s])

    return max(r[s][j_down], sum_square(x[:t-s]))

def get_min_f_and_s(j_down, t):
    """ range s from 1 to t-1 and set s to minimize f(s)
    """
    items = [(s, f(s, j_down, t)) for s in range(t)]
    s, min_f = min(items, key=lambda x:x[1])
    return s, min_f

def seq_out(n,k):
    for j in range(k):
        if j == 0:
            for t in range(n):
                r[t][j] = sum_square(x[:t+1])

                c[t][j] = x[:t+1]
        else:
            for t in range(1, n):
                s, min_f = get_min_f_and_s(j - 1, t)
                r[t][j] = min_f
                c[t][j] = [c[s][j - 1]] + x[s+1:t+1]

    print('the max SSE is: {}'.format(r[-1][-1]))
    print('the cluster centers are: {}'.format(c[-1][-1]))

#x = [7,6,9,15,18,17,30,28,29]    
x = [52, 101, 103, 101, 6, 5, 7]
k = 3
n = len(x)

r = [[[] for _ in range(k)] for _ in range(n)]
c = [[[] for _ in range(k)] for _ in range(n)]

print(seq_out(n,k))
print(r)
print(c)

Edit: Question layout

Given a sequence X = [x_1, x_2, ... x_n] and integer k > 1, partition X into clusters C_1,..., C_k of sizes n_1, ..., n_k, so that the sum of squared errors is minimized.

Solution

I am unable to trace how you think your code should work, and therefore I cannot tell you what mistake you are making. Also since you're trying to learn I'll offer you the opportunity to think through how to do it, rather than just code that appeared by magic.

Assuming that you want to have a bottom up approach, one approach is to fill in the following table (which is better done as an array of arrays, but I'll do as a dictionary of dictionaries to make it easier to read):

best_cluster_by_pos_by_clusters = {
    0: {
        1: {'start': 0, 'error': 0.0, 'max_error': 0.0}
        },
    1: {
        1: {'start': 0, 'error': 1200.5, 'max_error': 1200.5},
        2: {'start': 1, 'error': 0.0, 'max_error': 0.0},
        }, 
    2: {
        1: {'start': 0, 'error': 1668.6666666666667, 'max_error': 1668.6666666666667},
        2: {'start': 1, 'error': 2.0, 'max_error': 2.0},
        3: {'start': 2, 'error': 0.0, 'max_error': 0.0},
        },
    3: {
        1: {'start': 0, 'error': 1852.75, 'max_error': 1852.75},
        2: {'start': 1, 'error': 2.666666666666667, 'max_error': 2.666666666666667},
        3: {'start': 3, 'error': 0.0, 'max_error': 2.0},
        },
    4: {
        1: {'start': 0, 'error': 7397.2, 'max_error': 7397.2},
        2: {'start': 4, 'error': 0.0, 'max_error': 1852.75},
        3: {'start': 4, 'error': 0.0, 'max_error': 2.666666666666667},
        },
    5: {
        1: {'start': 0, 'error': 11205.333333333334, 'max_error': 11205.333333333334},
        2: {'start': 4, 'error': 0.5, 'max_error': 1852.75},
        3: {'start': 4, 'error': 0.5, 'max_error': 2.666666666666667},
        },
    6: {
        1: {'start': 0, 'error': 13735.714285714286, 'max_error': 13735.714285714286},
        2: {'start': 4, 'error': 2.0, 'max_error': 1852.75},
        3: {'start': 4, 'error': 2.0, 'max_error': 2.666666666666667},
        },
}

Here is how to interpret that table.

The fact that best_cluster_by_pos_by_clusters[6][3] is {'start': 4, 'error': 2.0, 'max_error': 2.666666666666667} means that the optimal division of the numbers from positions 0-6 is to have the third cluster have the numbers at positions 4, 5, 6. That cluster has a squared error of 2.0, and goes with a max of 2.666666666666667. Which gives you the cluster [6, 5, 7] and to find the rest we go to best_cluster_by_pos_by_clusters[3][2] (ie best division into 2 clusters ending at position 3) and we similarly find the cluster [101, 103, 101]. And then we're left looking at best_cluster_by_pos_by_clusters[0][1] (best 1 cluster ending at position 0) which gives us the last cluster of [52].

So figure out how to write code to fill in that table, and then the code to extract the answer from that table, and you'll have a bottom-up dynamic programming solution.

As for filling it in, as an example, to fill in best_cluster_by_pos_by_clusters[3][1] what I had to do was look at best_cluster_by_pos_by_clusters[i][0] for i=0, 1, 2 to see every division of a previous cluster versus a current one.