Search code examples
pythonmachine-learningdata-sciencek-means

Problems to use Array in K-means


a help please, I'm running the code below to do the onehotencoder of a column, then I want to pass this column to my dataset and then run K-means, but when I pass the information, I'm using tolist (), To fit the column, when running K-means I have the following problem: ValueError: setting an array element with a sequence. I searched a little about it, but I didn't find a definitive solution ...

I'm using 45 columns, at first I'm putting in a Dataframe, but if I had a way to put in an Array each column would be even more interesting.

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=True)
SP_results_one_hot0 = encoder.fit_transform(SP_results_Array[:,0].reshape(-1,1))
SP_results_one_hot1 = encoder.fit_transform(SP_results_Array[:,1].reshape(-1,1))
SP_results_one_hot2 = encoder.fit_transform(SP_results_Array[:,2].reshape(-1,1))
SP_results_one_hot3 = encoder.fit_transform(SP_results_Array[:,3].reshape(-1,1))
SP_results_one_hot4 = encoder.fit_transform(SP_results_Array[:,4].reshape(-1,1))
SP_results_one_hot5 = encoder.fit_transform(SP_results_Array[:,5].reshape(-1,1))
SP_results_one_hot6 = encoder.fit_transform(SP_results_Array[:,6].reshape(-1,1))
SP_results_one_hot7 = encoder.fit_transform(SP_results_Array[:,7].reshape(-1,1))
SP_results_one_hot8 = encoder.fit_transform(SP_results_Array[:,8].reshape(-1,1))
SP_results_one_hot9 = encoder.fit_transform(SP_results_Array[:,9].reshape(-1,1))



SP_results["Division Vendedor"] = SP_results_one_hot0.toarray().tolist()
SP_results["Tiempo en la Empresa"] = SP_results_one_hot1.toarray().tolist()
SP_results["Id Supervisor"] = SP_results_one_hot2.toarray().tolist()
SP_results["ID Region"] = SP_results_one_hot3.toarray().tolist()
SP_results["cargo"] = SP_results_one_hot4.toarray().tolist()
SP_results["address"] = SP_results_one_hot5.toarray().tolist()
SP_results["Idad"] = SP_results_one_hot6.toarray().tolist()
SP_results["sexo"] = SP_results_one_hot7.toarray().tolist()
SP_results["Nacion"] = SP_results_one_hot8.toarray().tolist()
SP_results["Tipo de vendedor"] = SP_results_one_hot9.toarray().tolist()


features =SP_results

from sklearn.cluster import KMeans

    km = KMeans(n_clusters=i)
    clusters = km.fit(features)


ValueError: setting an array element with a sequence.


Solution

  • Instead of handling each column separately, you can use get_dummies and defines columns list. It will take care of it. Following is the example:

    import pandas as pd
    col_list = ["A","B","C"]
    # data is pandas dataframe
    data_new = pd.get_dummies(data, col_list)
    

    As kmean need input in array format. You can do something like this.

    km = KMeans(n_clusters=i)
    # data_new.values will convert the dataframe to array
    clusters = km.fit(data_new.values)
    

    Hope this help.

    Reference:

    1. pandas.get_dummies

    2. kmean