I'm almost new to clustering and a bit confused about the method to use. I have a set of buildings that I want to cluster them according to their energy consumption, size, type, and neighborhood. I used k-means method and I used "get_dummies" method to deal with my categorical data.
I would like to ask if this is a correct way to deal with categorical data? (I also tried to simply map them to numbers like 1,2,3,etc and normalize them before clustering, but didn't received suitable results) In case you suggest another algorithm (random forest, svm, or anything else) I appreciate it if you provide me a link or website to learn it.
Another question is, if I want one of my features have a more effect on this clustering, is it fine to multiply it by 2 after normalization and then run the clustering part?
Thanks.
** What I mean by "get_dummies"?
I think that's pretty much it! Use label encoders or one-hot-encoding to convert non-numerics into numerics.
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder# creating initial dataframe
bridge_types = ('Arch','Beam','Truss','Cantilever','Tied Arch','Suspension','Cable')
bridge_df = pd.DataFrame(bridge_types, columns=['Bridge_Types'])# creating instance of labelencoder
labelencoder = LabelEncoder()# Assigning numerical values and storing in another column
bridge_df['Bridge_Types_Cat'] = labelencoder.fit_transform(bridge_df['Bridge_Types'])
bridge_df
Result:
Bridge_Types Bridge_Types_Cat
0 Arch 0
1 Beam 1
2 Truss 6
3 Cantilever 3
4 Tied Arch 5
5 Suspension 4
6 Cable 2
Or...
import pandas as pd
import numpy as np# creating initial dataframe
bridge_types = ('Arch','Beam','Truss','Cantilever','Tied Arch','Suspension','Cable')
bridge_df = pd.DataFrame(bridge_types, columns=['Bridge_Types'])# generate binary values using get_dummies
dum_df = pd.get_dummies(bridge_df, columns=["Bridge_Types"], prefix=["Type_is"] )# merge with main df bridge_df on key values
bridge_df = bridge_df.join(dum_df)
bridge_df
Just keep in mind, if you have a whole lot of labels, your data is going to be pretty sparse after you make everything numeric. Also, yes, you can 'game it' be doubling a feature. Here is basic example.
import numpy as np
data = np.asarray([np.asarray(DF['Feature1']),np.asarray(DF['Feature1']),np.asarray(DF['Feature2'])])
It seems a little weird, and I've never done that in practice, but it should give you your desired results. You know what...test it and see how you get along. Finally, when you have some free time, read through the stuff on the link below. You will learn a lot from that link.