I am problably making a lot of mistakes here.
I've created this dictionary, to organize the frequent words extracted from two pieces of texts.
dims = {('1', 2,0),('beam',4,2),('electron',3,7),('electron-beam', 2,0),('focused',0,2),('generation', 2,0),('relativistic',0,2),('requirements',3,0),('sample', 0,2),('stringent', 2,0),('ultrafast', 0,5)}
First word is number '1', it appears 2 times in text1 and 0 times in text2. Second word is 'beam', it appears 4 times in text1 and 2 times in text2 and so on.
I want to get a scatterplot of the svd (singular value decomposition) for the words in the two texts.The final result should be something like this:
I could not find a solution, so I started to break the task:
Like creating an array with the numbers only:
dinums = [[2,0],[4,2],[3,7],[2,0],[0,2],[2,0],[0,2],[3,0],[0,2],[2,0],[0,5]]
And could manage to extract a SVD that made sense:
import scipy as sp
U, s, Vh = sp.linalg.svd(a)
This is the dataframe with the results:
svd_df = pd.DataFrame(U[:,0:2])
print(svd_df)
0 1
0 -0.090641 0.300286
1 -0.353900 0.442893
2 -0.740124 -0.101449
3 -0.090641 0.300286
4 -0.172618 -0.157679
5 -0.090641 0.300286
6 -0.172618 -0.157679
7 -0.135962 0.450429
8 -0.172618 -0.157679
9 -0.090641 0.300286
10 -0.431545 -0.394199
I renamed the columns, to use them in the scatterplot:
svd_df = svd_df.rename(columns={0:'Abstr_1', 1:'Abstr_2'})
My intent to plot the scatterplot (big fail!!!):
import seaborn as sns
import matplotlib.pyplot as plt
color_dict = dict({'Abstr_1':'Orange',
'Abstr_2':'Grey'})
# Scatter plot: SV1 and SV2
sns.scatterplot(x="Abstr_1", y="Abstr_2",
palette=color_dict,
data=svd_df, s=100,
alpha=0.7)
plt.xlabel('Abstract 1:'.format(s), fontsize=10)
plt.ylabel('Abstract 2:'.format(s), fontsize=10)
Does this help you in getting closer to what you have in mind?
Remarks:
Your data frame contains duplicated entries (e.g. indexes 0, 3, 5, 9) after running SVD (possibly a sign of collinearity), is this intended? If not, try different model parameters to improve the outputs.
color_dict = dict({'Abstr_1':'Orange','Abstr_2':'Grey'})
is confusing, (x, y) forms a point, are you trying to plot half orange and half grey for each point? If you want varying colors for points, you can use clustering (grouping) like I do.
You can use .text
to label your points. Note that some words overlapped due to duplicated data frame entries.
Code
# your word list
words = ['one','beam','electron','electron-beam','focused',\
'generation','relativistic','requirements','sample','stringent','ultrafest']
# randomly assinging cluster for the purpose of demonstration
svd_df['Cluster'] = [0,1,0,1,1,0,0,1,0,1,0]
# at this stage your data frame looks like this
print(svd_df)
# Abstr_1 Abstr_2 Cluster
# 0 -0.090641 0.300286 0
# 1 -0.353900 0.442893 1
# 2 -0.740124 -0.101449 0
# 3 -0.090641 0.300286 1
# 4 -0.172618 -0.157679 1
# 5 -0.090641 0.300286 0
# 6 -0.172618 -0.157679 0
# 7 -0.135962 0.450429 1
# 8 -0.172618 -0.157679 0
# 9 -0.090641 0.300286 1
# 10 -0.431545 -0.394199 0
color_list = ['Orange','Grey']
# Scatter plot: SV1 and SV2
g = sns.scatterplot(x='Abstr_1', y='Abstr_2', hue='Cluster',
palette=color_list,
data=svd_df,s=100,
alpha=0.7)
for x,y,z in zip(svd_df['Abstr_1'],svd_df['Abstr_2'],words):
g.text(x,y+0.01,z)
plt.xlabel('Abstract 1', fontsize=10)
plt.ylabel('Abstract 2', fontsize=10)