Search code examples
pythonpandasdataframedendrogram

Dendrogram with Python - 4 categories


experts,

what I want to do, as a python beginner, is to create a dendrogram with the following data:

data = pd.DataFrame([['Apple, livingroom, worker', 200], ['Strawberry, bedroom, student', 100],
                     ['Apple, bedroom, child', 150], ['Strawberry, toilet, student', 100]], columns = ['Text', 'Costs'])

It is only an example (!)-Dataset, my real set is much longer but the structure is the same.

The dataset looks like this:

Out[89]: 
                           Text  Costs
0     Apple, livingroom, worker    200
1  Strawberry, bedroom, student    100
2         Apple, bedroom, child    150
3   Strawberry, toilet, student    100

My steps were the following: ONE: I used Tfidfvectorizer to get my text-column into numeric numbers, so that I can create a dendrogram. => Is there any other option?

So I did the following:

tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(data['Text']).todense()
vocab = tfidf.vocabulary_
new_cols = tfidf.get_feature_names()

df = data.drop('Text',axis=1)
df = df.join(pd.DataFrame(tfidf_matrix, columns=new_cols))

And my Output is:

Out[92]: 
   Costs     apple   bedroom  ...   student    toilet    worker
0    200  0.486934  0.000000  ...  0.000000  0.000000  0.617614
1    100  0.000000  0.577350  ...  0.577350  0.000000  0.000000
2    150  0.526405  0.526405  ...  0.000000  0.000000  0.000000
3    100  0.000000  0.000000  ...  0.526405  0.667679  0.000000

[4 rows x 9 columns]

TWO: Now I wanted to create a Dendrogram and see the Labels. What I want and what I expected was a Dendrogramm with only the Labels 200, 100 and 150 (like this is or was my goal). It would be okay, if there comes the 100 i.e. more than one time on the x-axis. So I wrote:

linked = linkage(df, 'ward')

# Dendrogramm und Label erstellen
labels = df.columns
p = len(labels)

plt.figure(figsize=(8,4))
plt.title('Hierarchical Clustering Dendrogram (truncated)', fontsize=20)
plt.xlabel('Look at my fancy labels!', fontsize=16)
plt.ylabel('distance', fontsize=16)

# Rufe das Dendrogramm auf, um das Dict zu bekommen
R = dendrogram(
                linked,
                truncate_mode='lastp',  # show only the last p merged clusters
                p=p,  # show only the last p merged clusters
                no_plot=True,
                )

# Label-Dict 
temp = {R["leaves"][ii]: labels[ii] for ii in range(len(R["leaves"]))}
def llf(xx):
    return "{}".format(temp[xx])

dendrogram(
            linked,
            truncate_mode='lastp',  # show only the last p merged clusters
            p=p,  # show only the last p merged clusters
            leaf_label_func=llf,
            leaf_rotation=60.,
            leaf_font_size=12.,
            show_contracted=True,  # to get a distribution impression in truncated branches
            )
plt.show()

and my Dendrogram looks like this:

enter image description here

Instead of "Costs" etc., I expected "200", "100", "150", "100". It should be like a decision tree, so i.e. If I have an apple, a living room and if I am a worker, my costs are 200 (from up to down through the Dendrogram). Or: If I have an apple, a bedroom and if I am a child, my costs are 150.

Can anyone help how to show something like this?


Solution

  • If you want to have the desired output, you need to change:

    labels = df.columns
    

    with

    labels = df.Costs
    

    To add legends for colors, replace:

    plt.show()
    

    with

    plt.legend(df.columns)
    plt.show()
    

    enter image description here enter image description here