Search code examples
pythonscikit-learnmatrix-factorization

Sklearn NMF match input category order


I'm using Non-Negative Matrix Factorization (NMF) in sklearn for unsupervised category prediction (with labels for checking accuracy), but am running into the problem where I don't have a clear map between input categories and transformed categories.

For example, if my categories are "A", "B", and "C" (n_components=3) I don't know which order the transformed categories will be in. I can manually print the data associated with each output feature to determine which input it most closely resembles, but am looking for an automatic solution.

Is there a convenient method for this, or do I need to perform guess-and-check to see what category order maximizes accuracy (very slow for large numbers of categories)?


Solution

  • Found a solution, which is much faster than guess-and-check. Simply averaging the model's predicted numeric category for all known instances of that category gives the approximate mapping, which can then be converted to integers:

    values = []
    for c in ['A', 'B', 'C', 'D', 'E']: # Example Categories
        # Get samples from one category at a time
        idxs = train['Category'] == c
        # Get the average (numeric) output category for those samples
        values.append(np.mean(np.argmax(model.transform(data['Train'][idxs]), 1)))
    
    # Rounding may be unreliable, so sort pairs of value and category by ascending order
    # and use their indexes instead to guarantee no duplicates
    pairs = list(zip(values, cats))
    pairs.sort()
    # The category mapping is {category: index}
    catmap = {k: i for i, k in enumerate([p[1] for p in pairs])}
    

    Output Example:

    {'B': 0, 'A': 1, 'C': 2, 'E': 3, 'D': 4}