Search code examples
python-3.xscikit-learncluster-analysisdbscan

How to Cluster Infrared Spectroscopy Data with Python


I have been looking at clustering infrared spectroscopy data with the sklearn clustering methods. I am having trouble getting the clustering to work with the data, since I'm new to this I don't know if the way I'm coding it is wrong or my approach is wrong.

My data, in Pandas DataFrame format, looks like this:

Index     Wavenumbers (cm-1)     %Transmission_i   ...
0         650                    100               ... 
.          .                      .                ...
.          .                      .                ...
.          .                      .                ...
n         4000                   95                ...

where, the x-axis for all spectra is the Wavenumbers (cm-1) column and the subsequent columns (%Transmission_i) are the actual data. I want to cluster these columns (in terms of which spectra are most similar to each other), as such I am trying this code:

X        = np.array([list(df[x].values) for x in df.set_index(x)])
clusters = DBSCAN().fit(X)

where df is my DataFrame, and np is numpy (hopefully obvious). The problem is when I print out the cluster labels it just spits out nothing but -1 which means all my data is noise. This isn't the case, when I plot my data I can clearly see a some spectra look very similar (as they should).

How can I get the similar spectra to be clustered properly?

EDIT: Here is a minimum working example.

import numpy as np
import pandas as pd
import sklearn as sk
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN

x = 'x-vals'

def cluster_data(df):

    avg_list = []
    dif_list = []
    for col in df:
        if x == col:
            continue
        avg_list.append(np.mean(df[col].values))
        dif_list.append(np.mean(np.diff(df[col].values)))

    a = sk.preprocessing.normalize([avg_list], norm='max')[0]
    b = sk.preprocessing.normalize([dif_list], norm='max')[0]

    X = []
    for i,j in zip(a,b):
        X.append([i,j])

    X = np.array(X)
    clusters = DBSCAN(eps=0.2).fit(X)

    return clusters.labels_

def plot_clusters(df, clusters):
    colors = ['red', 'green', 'blue', 'black', 'pink']
    i      = 0
    for col in df:
        if col == x:
            continue
        color = colors[clusters[i]]
        plt.plot(df[x], df[col], color=color)
        i +=1
    plt.show()


x1  = np.linspace(-np.pi, np.pi, 201)
y1  = np.sin(x1) + 1
y2  = np.cos(x1) + 1
y3  = np.zeros_like(x1) + 2
y4  = np.zeros_like(x1) + 1.9
y5  = np.zeros_like(x1) + 1.8
y6  = np.zeros_like(x1) + 1.7
y7  = np.zeros_like(x1) + 1
y8  = np.zeros_like(x1) + 0.9
y9  = np.zeros_like(x1) + 0.8
y10 = np.zeros_like(x1) + 0.7

df  = pd.DataFrame({'x-vals':x1, 'y1':y1, 'y2':y2, 'y3':y3, 'y4':y4,
                    'y5':y5, 'y6':y6, 'y7':y7, 'y8':y8, 'y9':y9,
                    'y10':y10})

clusters = cluster_data(df)

plot_clusters(df, clusters)

This produces the following plot, where red is a cluster and pink is noise. plot made by minimum working example


Solution

  • I was able to get a method working, but I'm not fully convinced this is the best method for clustering IR spectra.

    First I run through all the spectra and compile a list of the mean and mean of the first derivative of each spectra. The mean is supposed to be representative of the vertical location of the spectra, while the mean of the first derivative is supposed to be representative of the shape of the spectra.

    avg_list = []
    dif_list = []
    for col in df:
        if x == col:
           continue
        avg_list.append(np.mean(df[col].values))
        dif_list.append(np.mean(np.dif(df[col].values)))
    

    Then I normalize each list, this is so I can pick a eps value based on percent changes.

    a = sk.preprocessing.normalize([avg_list], norm='max')[0]
    b = sk.preprocessing.normalize([diff_list], norm='max')[0]
    

    After that I make a 2D array for runnning DBSCAN in 2D mode.

    X = []
    for i,j in zip(a,b):
        X.append([i,j])
    

    Then I run the DBSCAN clustering method with an arbitrary percent difference value for the eps parameter.

    X        = np.array(X)
    clusters = DBSCAN(eps=0.2).fit(X)
    

    Then clusters.labels_ returns an array with the length of the number of spectra in my DataFrame. It works fairly well, but it is rather exclusive and the clusters could be better. Some more fine tuning would be helpful.