Search code examples
pythonscikit-learncluster-analysisk-meansrandom-forest

Which algorithm to use for a clustering problem when we have numerical and categorical data?


I'm almost new to clustering and a bit confused about the method to use. I have a set of buildings that I want to cluster them according to their energy consumption, size, type, and neighborhood. I used k-means method and I used "get_dummies" method to deal with my categorical data.

I would like to ask if this is a correct way to deal with categorical data? (I also tried to simply map them to numbers like 1,2,3,etc and normalize them before clustering, but didn't received suitable results) In case you suggest another algorithm (random forest, svm, or anything else) I appreciate it if you provide me a link or website to learn it.

Another question is, if I want one of my features have a more effect on this clustering, is it fine to multiply it by 2 after normalization and then run the clustering part?

Thanks.

** What I mean by "get_dummies"?

enter image description here


Solution

  • I think that's pretty much it! Use label encoders or one-hot-encoding to convert non-numerics into numerics.

    import pandas as pd
    import numpy as np
    from sklearn.preprocessing import LabelEncoder# creating initial dataframe
    bridge_types = ('Arch','Beam','Truss','Cantilever','Tied Arch','Suspension','Cable')
    bridge_df = pd.DataFrame(bridge_types, columns=['Bridge_Types'])# creating instance of labelencoder
    labelencoder = LabelEncoder()# Assigning numerical values and storing in another column
    bridge_df['Bridge_Types_Cat'] = labelencoder.fit_transform(bridge_df['Bridge_Types'])
    bridge_df
    

    Result:

      Bridge_Types  Bridge_Types_Cat
    0         Arch                 0
    1         Beam                 1
    2        Truss                 6
    3   Cantilever                 3
    4    Tied Arch                 5
    5   Suspension                 4
    6        Cable                 2
    

    Or...

    import pandas as pd
    import numpy as np# creating initial dataframe
    bridge_types = ('Arch','Beam','Truss','Cantilever','Tied Arch','Suspension','Cable')
    bridge_df = pd.DataFrame(bridge_types, columns=['Bridge_Types'])# generate binary values using get_dummies
    dum_df = pd.get_dummies(bridge_df, columns=["Bridge_Types"], prefix=["Type_is"] )# merge with main df bridge_df on key values
    bridge_df = bridge_df.join(dum_df)
    bridge_df
    

    Just keep in mind, if you have a whole lot of labels, your data is going to be pretty sparse after you make everything numeric. Also, yes, you can 'game it' be doubling a feature. Here is basic example.

    import numpy as np
    data = np.asarray([np.asarray(DF['Feature1']),np.asarray(DF['Feature1']),np.asarray(DF['Feature2'])])
    

    It seems a little weird, and I've never done that in practice, but it should give you your desired results. You know what...test it and see how you get along. Finally, when you have some free time, read through the stuff on the link below. You will learn a lot from that link.

    https://scikit-learn.org/stable/modules/clustering.html