I have following list that I would like to perform unsupervised learning on and use the knowledge to predict a value for each item in the test list
#Format [real_runtime, processors, requested_time, score, more_to_be_added]
#some entries from the list
Xsrc = [['354', '2048', '3600', '53.0521472395'],
['605', '2048', '600', '54.8768871369'],
['128', '2048', '600', '51.0'],
['136', '2048', '900', '51.0000000563'],
['19218', '480', '21600', '51.0'],
['15884', '2048', '18000', '51.0'],
['118', '2048', '1500', '51.0'],
['103', '2048', '2100', '51.0000002839'],
['18542', '480', '21600', '51.0000000001'],
['13272', '2048', '18000', '51.0000000001']]
Using the clusters I would like to predict the real_runtime of a new list: Xtest= [['-1', '2048', '1500', '51.0000000161'], ['-1', '2048', '10800', '51.0000000002'], ['-1', '512', '21600', '-1'], ['-1', '512', '2700', '51.0000000004'], ['-1, '1024', '21600', '51.1042617556']]
from sklearn.feature_selection import VarianceThreshold
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
##Training dataset
Xsrc = [['354', '2048', '3600', '53.0521472395'],
['605', '2048', '600', '54.8768871369'],
['128', '2048', '600', '51.0'],
['136', '2048', '900', '51.0000000563'],
['19218', '480', '21600', '51.0'],
['15884', '2048', '18000', '51.0'],
['118', '2048', '1500', '51.0'],
['103', '2048', '2100', '51.0000002839'],
['18542', '480', '21600', '51.0000000001'],
['13272', '2048', '18000', '51.0000000001']]
print "Xsrc:", Xsrc
##Test data set
Xtest= [['1224', '2048', '1500', '51.0000000161'],
['7867', '2048', '10800', '51.0000000002'],
['21594', '512', '21600', '-1'],
['1760', '512', '2700', '51.0000000004'],
['115', '1024', '21600', '51.1042617556']]
##Clustering
X = StandardScaler().fit_transform(Xsrc)
db = DBSCAN(min_samples=2).fit(X) #no clustering parameter, such as default eps
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
clusters = [X[labels == i] for i in xrange(n_clusters_)]
print('Estimated number of clusters: %d' % n_clusters_)
print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(X, labels))
##Plotting the dataset
unique_labels = set(labels)
colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = 'k'
class_member_mask = (labels == k)
xy = X[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=20)
xy = X[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=10)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
Any ideas how I can use the clusters to predict the value?
There is little use in "predicting" a cluster label, because it was just assigned "randomly" by the clustering algorithm.
Even worse: most algorithms cannot incorporate new data.
You really should use clustering to explore your data, and learn what is there and what not. Do not rely on the clustering being 'good'.
Sometimes, people have success with quantizing the data set into k centers, and then using only this "compressed" data set for classification/prediction (usually based on the nearest neighbor only). I have also seen the idea around of training onemregression per cluster for prediction, and choosing the regressor to apply using nearest neighbors (i.e. if the data fits a cluster well, use the clusters regression model). But I don't remember any major success stories...