Search code examples
pythonpandasdataframecluster-analysis

Clustering values in a dataframe in python


I have a dataframe with 76 columns. 1st column contains date values and the other 75 columns are groundwater levels form 75 different boreholes. I want to cluster the boreholes based on the trend (boreholes that follow the same pattern are grouped together). How can I do this in python?

Here is a sample of my dataframe

df = pd.DataFrame({
'Date': [1980, 1985, 1990, 1995, 2000],
'borehole1': [53, 43, 33, 22, 18],
'borehole2': [50, 40, 30, 50, 40],
'borehole3': [22, 32, 42, 32, 13],
'borehole4': [60, 65, 82, 72, 60],
'borehole5': [80, 70, 60, 80, 70],
'borehole6': [43, 33, 22, 18, 13]
}) 

df.plot()

In this example I would therefore have 3 clusters:

  • borehole1 & borehole 6 >> cluster 1
  • borehole2 & borehole 5 >> cluster 2
  • borehole 4 & borehole 3 >> cluster 3

Solution

  • The K-Means algo is perfect for this! Here is a sample (below). Just point the X and y to your specific dataset and set the 'K' to 3 (already done for you in this example).

    # K-MEANS CLUSTERING
    # Importing Modules
    from sklearn import datasets
    from sklearn.cluster import KMeans
    import matplotlib.pyplot as plt
    from sklearn.decomposition import PCA
    # Loading dataset
    iris_df = datasets.load_iris()
    
    # Declaring Model
    model = KMeans(n_clusters=3)
    
    # Fitting Model
    model.fit(iris_df.data)
    
    # Predicitng a single input
    predicted_label = model.predict([[7.2, 3.5, 0.8, 1.6]])
    
    # Prediction on the entire data
    all_predictions = model.predict(iris_df.data)
    
    # Printing Predictions
    print(predicted_label)
    print(all_predictions)
    
    
    # import some data to play with
    iris = datasets.load_iris()
    X = iris.data[:, :3]  # we only take the first two features.
    y = iris.target
    
    
    fig = plt.figure(figsize=(10,10))
    plt = fig.add_subplot(111, projection='3d')
    plt.scatter(X[:,0],X[:,1],X[:,2], 
                c=all_predictions, edgecolor='red', s=40, alpha = 0.5)
    plt.set_title("First three PCA directions")
    plt.set_xlabel("Educational_Degree")
    plt.set_ylabel("Gross_Monthly_Salary")
    plt.set_zlabel("Claim_Rate")
    plt.dist = 10
    plt
    

    enter image description here

    See this link for more info.

    https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html