Search code examples
pythonk-meansunsupervised-learningtsne

Unsupervised learning using TSNE and Kmeans


I am trying to do unsupervised learning on the dataset to do feature extraction, and find out which group of data is gathered together and what is the main features(centroid) of that group of data. So, I planned to use Kmeans to find the weightage of each centroid. But before using Kmeans, I use TSNE to lower down the dimensions of my data, so it can be presented in a scatter plot. I aim to get the centroid with the most bad condition data points and the least good condition data points. This is the sample of my code.

# Set a seed for reproducibility
np.random.seed(42)

# Generate dummy data with random values
num_rows = 1000

# Create a DataFrame with random values and specific column names
dummy_data = pd.DataFrame({
    'Name': [np.random.choice(['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'Grace', 'Henry', 'Isabella', 'Jack', 'Kate', 'Liam', 'Mia', 'Noah', 'Olivia', 'Peter', 'Quinn', 'Rachel', 'Sam', 'Taylor']
) for _ in range(num_rows)],
    'Condition': np.random.choice(['Good', 'Bad'], size=num_rows),
    'Latency_Wifi': np.random.normal(loc=1, scale=0.2, size=num_rows),  # 'Good' condition has lower latency
    'Loss_Wifi': np.random.normal(loc=0.05, scale=0.02, size=num_rows),   # 'Good' condition has lower loss
    'Latency_Gaming': np.random.normal(loc=1, scale=0.2, size=num_rows),
    'Loss_Gaming': np.random.normal(loc=0.05, scale=0.02, size=num_rows),
    'Latency_Video': np.random.normal(loc=1, scale=0.2, size=num_rows),
    'Loss_Video': np.random.normal(loc=0.05, scale=0.02, size=num_rows),
    'Latency_WFH': np.random.normal(loc=1, scale=0.2, size=num_rows),
    'Loss_WFH': np.random.normal(loc=0.05, scale=0.02, size=num_rows),
})

features = dummy_data.drop(['Name',
 'Condition'], axis=1)

# Standardize the data to have zero mean and unit variance
scaler = StandardScaler()
data_scaled = scaler.fit_transform(features)

# kpca = KernelPCA(n_components=10, kernel='rbf', gamma=0.1)
# data_kpca = kpca.fit_transform(data_scaled)

# Apply t-SNE for further dimensionality reduction
tsne = TSNE(n_components=2, random_state=42)
data_tsne = tsne.fit_transform(data_scaled)

df = dummy_data

features = dummy_data.drop(['Name',
 'Condition'], axis=1)
columns_of_interest =  features.columns.to_list()

# Apply K-means on the t-SNE components
n_clusters = 10
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
labels = kmeans.fit_predict(data_tsne)

# Add t-SNE components and cluster labels to the original DataFrame
df['TSNE_Component_1'] = data_tsne[:, 0]
df['TSNE_Component_2'] = data_tsne[:, 1]
df['Cluster'] = labels

# Get the centroid coordinates
centroids = pd.DataFrame(scaler.inverse_transform(kmeans.cluster_centers_), columns=columns_of_interest)

# Display the main features for each centroid
for cluster_num in range(n_clusters):
    centroid_features = centroids.iloc[cluster_num]
    main_features = centroid_features.abs().sort_values(ascending=False).head(3)  # Display top 3 features
    print(f"Cluster {cluster_num + 1}: Main Features - {main_features.index.tolist()}")

# Count the number of users in each cluster
cluster_counts = df['Cluster'].value_counts().reset_index()
cluster_counts.columns = ['Cluster', 'Number_of_Users']

# Select the top 10 clusters based on the highest number of users
top_clusters = cluster_counts.nlargest(10, 'Number_of_Users')['Cluster'].tolist()

# Filter the DataFrame for the top clusters
df_top_clusters = df[df['Cluster'].isin(top_clusters)]

But I ran into an error when running the code above:

centroids = pd.DataFrame(scaler.inverse_transform(kmeans.cluster_centers_), columns=columns_of_interest)**

**ValueError: operands could not be broadcast together with shapes (10,2) (8,) (10,2)

One of my friends suggested me to use another function to lower the dimension of my data from non-linear to linear. But I thought that was the purpose of using TSNE?


Solution

  • You fit the scaler with the features, which have 8 columns, and then you tried to inverse transform on the tsne data, which has 2 columns.

    To quickly solve this, apply the scaling on the tsne data:

    tsne = TSNE(n_components=2, random_state=42)
    data_tsne = tsne.fit_transform(data_scaled)
    data_scaled = scaler.fit_transform(data_tsne)
    

    I got the following when doing so:

    Results