Search code examples
pythonpandask-means

How to input 3D data from dataframe for k-means clustering?


I have 505 sets of patient data (rows), each containing 17 sets (cols) of 3D [x,y,z] arrays.

In : data.iloc[0][0]
Out: array([ -23.47808471,   -9.92158009, 1447.74107884])

Snippet of df for clarity

Each set of patient data is a collection of 3D points marking centers of vertebrae, with 17 vertebrae marked per patient. I am attempting to use k-means clustering to classify how many different types of spines there are in the dataset, however, when trying to fit the model, I get errors such as "ValueError: setting an array element with a sequence." I am not quite sure on how to manipulate my dataframe so that each set of patient data is separate from one another.

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=4, n_init=10, max_iter=300)
kmeans.fit(data)

Thank you!

Plot of one row of data


Solution

  • kmeans.fit functions expects a 2-D array as input whereas in your case data is a 3-D array. One thing you can do is unravel the data points and turn them into individual features. Like this,

    # Do this for all positions
    data['Spine_L1_Center_x'] = data['Spine_L1_Center'].apply(lambda x: x[0])
    data['Spine_L1_Center_y'] = data['Spine_L1_Center'].apply(lambda x: x[1])
    data['Spine_L1_Center_z'] = data['Spine_L1_Center'].apply(lambda x: x[2])
    
    data.drop(columns=['Spine_L1_Center', ... ], inplace=True)
    

    And then try to fit that new data.