Search code examples
pythonseabornsvd

Problems to print out scatterplot from SVD of a dictionary of frequent words in Python


I am problably making a lot of mistakes here.

I've created this dictionary, to organize the frequent words extracted from two pieces of texts.

dims = {('1', 2,0),('beam',4,2),('electron',3,7),('electron-beam', 2,0),('focused',0,2),('generation', 2,0),('relativistic',0,2),('requirements',3,0),('sample', 0,2),('stringent', 2,0),('ultrafast', 0,5)}

First word is number '1', it appears 2 times in text1 and 0 times in text2. Second word is 'beam', it appears 4 times in text1 and 2 times in text2 and so on.

I want to get a scatterplot of the svd (singular value decomposition) for the words in the two texts.The final result should be something like this:

enter image description here

I could not find a solution, so I started to break the task:

Like creating an array with the numbers only:

dinums = [[2,0],[4,2],[3,7],[2,0],[0,2],[2,0],[0,2],[3,0],[0,2],[2,0],[0,5]]

And could manage to extract a SVD that made sense:

import scipy as sp

U, s, Vh = sp.linalg.svd(a)

This is the dataframe with the results:

svd_df = pd.DataFrame(U[:,0:2])
print(svd_df)
        0         1
0  -0.090641  0.300286
1  -0.353900  0.442893
2  -0.740124 -0.101449
3  -0.090641  0.300286
4  -0.172618 -0.157679
5  -0.090641  0.300286
6  -0.172618 -0.157679
7  -0.135962  0.450429
8  -0.172618 -0.157679
9  -0.090641  0.300286
10 -0.431545 -0.394199

I renamed the columns, to use them in the scatterplot:

svd_df = svd_df.rename(columns={0:'Abstr_1', 1:'Abstr_2'})

My intent to plot the scatterplot (big fail!!!):

import seaborn as sns
import matplotlib.pyplot as plt

color_dict = dict({'Abstr_1':'Orange',
                   'Abstr_2':'Grey'})
# Scatter plot: SV1 and SV2
sns.scatterplot(x="Abstr_1", y="Abstr_2", 
                palette=color_dict, 
                data=svd_df, s=100,
                alpha=0.7)
plt.xlabel('Abstract 1:'.format(s), fontsize=10)
plt.ylabel('Abstract 2:'.format(s), fontsize=10)

enter image description here


Solution

  • Does this help you in getting closer to what you have in mind?

    Remarks:

    • Your data frame contains duplicated entries (e.g. indexes 0, 3, 5, 9) after running SVD (possibly a sign of collinearity), is this intended? If not, try different model parameters to improve the outputs.

    • color_dict = dict({'Abstr_1':'Orange','Abstr_2':'Grey'}) is confusing, (x, y) forms a point, are you trying to plot half orange and half grey for each point? If you want varying colors for points, you can use clustering (grouping) like I do.

    • You can use .text to label your points. Note that some words overlapped due to duplicated data frame entries.

    Code

    # your word list
    words = ['one','beam','electron','electron-beam','focused',\
    'generation','relativistic','requirements','sample','stringent','ultrafest']
    
    # randomly assinging cluster for the purpose of demonstration
    svd_df['Cluster'] = [0,1,0,1,1,0,0,1,0,1,0]
    
    # at this stage your data frame looks like this
    print(svd_df)
    #      Abstr_1   Abstr_2  Cluster
    # 0  -0.090641  0.300286        0
    # 1  -0.353900  0.442893        1
    # 2  -0.740124 -0.101449        0
    # 3  -0.090641  0.300286        1
    # 4  -0.172618 -0.157679        1
    # 5  -0.090641  0.300286        0
    # 6  -0.172618 -0.157679        0
    # 7  -0.135962  0.450429        1
    # 8  -0.172618 -0.157679        0
    # 9  -0.090641  0.300286        1
    # 10 -0.431545 -0.394199        0
    
    color_list = ['Orange','Grey']
    
    # Scatter plot: SV1 and SV2
    g = sns.scatterplot(x='Abstr_1', y='Abstr_2', hue='Cluster',
                    palette=color_list, 
                    data=svd_df,s=100,
                    alpha=0.7)
    
    for x,y,z in zip(svd_df['Abstr_1'],svd_df['Abstr_2'],words):
        g.text(x,y+0.01,z)
    
    plt.xlabel('Abstract 1', fontsize=10)
    plt.ylabel('Abstract 2', fontsize=10)
    

    Output enter image description here