experts,
what I want to do, as a python beginner, is to create a dendrogram with the following data:
data = pd.DataFrame([['Apple, livingroom, worker', 200], ['Strawberry, bedroom, student', 100],
['Apple, bedroom, child', 150], ['Strawberry, toilet, student', 100]], columns = ['Text', 'Costs'])
It is only an example (!)-Dataset, my real set is much longer but the structure is the same.
The dataset looks like this:
Out[89]:
Text Costs
0 Apple, livingroom, worker 200
1 Strawberry, bedroom, student 100
2 Apple, bedroom, child 150
3 Strawberry, toilet, student 100
My steps were the following: ONE: I used Tfidfvectorizer to get my text-column into numeric numbers, so that I can create a dendrogram. => Is there any other option?
So I did the following:
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(data['Text']).todense()
vocab = tfidf.vocabulary_
new_cols = tfidf.get_feature_names()
df = data.drop('Text',axis=1)
df = df.join(pd.DataFrame(tfidf_matrix, columns=new_cols))
And my Output is:
Out[92]:
Costs apple bedroom ... student toilet worker
0 200 0.486934 0.000000 ... 0.000000 0.000000 0.617614
1 100 0.000000 0.577350 ... 0.577350 0.000000 0.000000
2 150 0.526405 0.526405 ... 0.000000 0.000000 0.000000
3 100 0.000000 0.000000 ... 0.526405 0.667679 0.000000
[4 rows x 9 columns]
TWO: Now I wanted to create a Dendrogram and see the Labels. What I want and what I expected was a Dendrogramm with only the Labels 200, 100 and 150 (like this is or was my goal). It would be okay, if there comes the 100 i.e. more than one time on the x-axis. So I wrote:
linked = linkage(df, 'ward')
# Dendrogramm und Label erstellen
labels = df.columns
p = len(labels)
plt.figure(figsize=(8,4))
plt.title('Hierarchical Clustering Dendrogram (truncated)', fontsize=20)
plt.xlabel('Look at my fancy labels!', fontsize=16)
plt.ylabel('distance', fontsize=16)
# Rufe das Dendrogramm auf, um das Dict zu bekommen
R = dendrogram(
linked,
truncate_mode='lastp', # show only the last p merged clusters
p=p, # show only the last p merged clusters
no_plot=True,
)
# Label-Dict
temp = {R["leaves"][ii]: labels[ii] for ii in range(len(R["leaves"]))}
def llf(xx):
return "{}".format(temp[xx])
dendrogram(
linked,
truncate_mode='lastp', # show only the last p merged clusters
p=p, # show only the last p merged clusters
leaf_label_func=llf,
leaf_rotation=60.,
leaf_font_size=12.,
show_contracted=True, # to get a distribution impression in truncated branches
)
plt.show()
and my Dendrogram looks like this:
Instead of "Costs" etc., I expected "200", "100", "150", "100". It should be like a decision tree, so i.e. If I have an apple, a living room and if I am a worker, my costs are 200 (from up to down through the Dendrogram). Or: If I have an apple, a bedroom and if I am a child, my costs are 150.
Can anyone help how to show something like this?
If you want to have the desired output, you need to change:
labels = df.columns
with
labels = df.Costs
To add legends for colors, replace:
plt.show()
with
plt.legend(df.columns)
plt.show()