I'm using Non-Negative Matrix Factorization (NMF) in sklearn for unsupervised category prediction (with labels for checking accuracy), but am running into the problem where I don't have a clear map between input categories and transformed categories.
For example, if my categories are "A", "B", and "C" (n_components=3) I don't know which order the transformed categories will be in. I can manually print the data associated with each output feature to determine which input it most closely resembles, but am looking for an automatic solution.
Is there a convenient method for this, or do I need to perform guess-and-check to see what category order maximizes accuracy (very slow for large numbers of categories)?
Found a solution, which is much faster than guess-and-check. Simply averaging the model's predicted numeric category for all known instances of that category gives the approximate mapping, which can then be converted to integers:
values = []
for c in ['A', 'B', 'C', 'D', 'E']: # Example Categories
# Get samples from one category at a time
idxs = train['Category'] == c
# Get the average (numeric) output category for those samples
values.append(np.mean(np.argmax(model.transform(data['Train'][idxs]), 1)))
# Rounding may be unreliable, so sort pairs of value and category by ascending order
# and use their indexes instead to guarantee no duplicates
pairs = list(zip(values, cats))
pairs.sort()
# The category mapping is {category: index}
catmap = {k: i for i, k in enumerate([p[1] for p in pairs])}
Output Example:
{'B': 0, 'A': 1, 'C': 2, 'E': 3, 'D': 4}