I have been looking at clustering infrared spectroscopy data with the sklearn clustering methods. I am having trouble getting the clustering to work with the data, since I'm new to this I don't know if the way I'm coding it is wrong or my approach is wrong.
My data, in Pandas DataFrame format, looks like this:
Index Wavenumbers (cm-1) %Transmission_i ...
0 650 100 ...
. . . ...
. . . ...
. . . ...
n 4000 95 ...
where, the x-axis for all spectra is the Wavenumbers (cm-1)
column and the subsequent columns (%Transmission_i
) are the actual data. I want to cluster these columns (in terms of which spectra are most similar to each other), as such I am trying this code:
X = np.array([list(df[x].values) for x in df.set_index(x)])
clusters = DBSCAN().fit(X)
where df
is my DataFrame, and np
is numpy (hopefully obvious). The problem is when I print out the cluster labels it just spits out nothing but -1
which means all my data is noise. This isn't the case, when I plot my data I can clearly see a some spectra look very similar (as they should).
How can I get the similar spectra to be clustered properly?
EDIT: Here is a minimum working example.
import numpy as np
import pandas as pd
import sklearn as sk
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
x = 'x-vals'
def cluster_data(df):
avg_list = []
dif_list = []
for col in df:
if x == col:
continue
avg_list.append(np.mean(df[col].values))
dif_list.append(np.mean(np.diff(df[col].values)))
a = sk.preprocessing.normalize([avg_list], norm='max')[0]
b = sk.preprocessing.normalize([dif_list], norm='max')[0]
X = []
for i,j in zip(a,b):
X.append([i,j])
X = np.array(X)
clusters = DBSCAN(eps=0.2).fit(X)
return clusters.labels_
def plot_clusters(df, clusters):
colors = ['red', 'green', 'blue', 'black', 'pink']
i = 0
for col in df:
if col == x:
continue
color = colors[clusters[i]]
plt.plot(df[x], df[col], color=color)
i +=1
plt.show()
x1 = np.linspace(-np.pi, np.pi, 201)
y1 = np.sin(x1) + 1
y2 = np.cos(x1) + 1
y3 = np.zeros_like(x1) + 2
y4 = np.zeros_like(x1) + 1.9
y5 = np.zeros_like(x1) + 1.8
y6 = np.zeros_like(x1) + 1.7
y7 = np.zeros_like(x1) + 1
y8 = np.zeros_like(x1) + 0.9
y9 = np.zeros_like(x1) + 0.8
y10 = np.zeros_like(x1) + 0.7
df = pd.DataFrame({'x-vals':x1, 'y1':y1, 'y2':y2, 'y3':y3, 'y4':y4,
'y5':y5, 'y6':y6, 'y7':y7, 'y8':y8, 'y9':y9,
'y10':y10})
clusters = cluster_data(df)
plot_clusters(df, clusters)
This produces the following plot, where red is a cluster and pink is noise.
I was able to get a method working, but I'm not fully convinced this is the best method for clustering IR spectra.
First I run through all the spectra and compile a list of the mean
and mean of the first derivative
of each spectra. The mean
is supposed to be representative of the vertical location of the spectra, while the mean of the first derivative
is supposed to be representative of the shape of the spectra.
avg_list = []
dif_list = []
for col in df:
if x == col:
continue
avg_list.append(np.mean(df[col].values))
dif_list.append(np.mean(np.dif(df[col].values)))
Then I normalize each list, this is so I can pick a eps
value based on percent changes.
a = sk.preprocessing.normalize([avg_list], norm='max')[0]
b = sk.preprocessing.normalize([diff_list], norm='max')[0]
After that I make a 2D array for runnning DBSCAN in 2D mode.
X = []
for i,j in zip(a,b):
X.append([i,j])
Then I run the DBSCAN clustering method with an arbitrary percent difference value for the eps
parameter.
X = np.array(X)
clusters = DBSCAN(eps=0.2).fit(X)
Then clusters.labels_
returns an array with the length of the number of spectra in my DataFrame. It works fairly well, but it is rather exclusive and the clusters could be better. Some more fine tuning would be helpful.