I calculated the cosine similarity of a dataframe similar to the following:
ciiu4n4 A0111 A0112 A0113
A0111 14 7 6
A0112 16 55 3
A0113 15 0 112
using this code:
data_cosine = mpg_data.drop(['ciiu4n4'], axis=1)
result = cosine_similarity(data_cosine)
I get as a result an array like this:
[[ 1. 0.95357118 0.95814892 ]
[ 0.95357118 1. 0.89993795 ]
[ 0.95814892 0.89993795 1. ]]
However, I need the result as a dataframe similar to the original one. I can't do it manually, because the original dataframe is 600 x 600.
The result that I need needs to look something similar like:
ciiu4n4 A0111 A0112 A0113
A0111 1. 0.95357118 0.95814892
A0112 0.95357118 1. 0.89993795
A0113 0.95814892 0.89993795 1.
I'd recommend changing your approach slightly. No need to drop any columns. Instead, set the first column as the index, compute cosine similarities, and assign the result array back to the dataframe.
df = df.set_index('ciiu4n4')
df
A0111 A0112 A0113
ciiu4n4
A0111 14 7 6
A0112 16 55 3
A0113 15 0 112
v = cosine_similarity(df.values)
df[:] = v
df.reset_index()
ciiu4n4 A0111 A0112 A0113
0 A0111 1.000000 0.953571 0.958149
1 A0112 0.953571 1.000000 0.899938
2 A0113 0.958149 0.899938 1.000000
The solution above only works when the number of rows and columns (excluding the first) are the same. So, here's another solution that should generalise to any scenario.
df = df.set_index('ciiu4n4')
v = cosine_similarity(df.values)
df = pd.DataFrame(v, columns=df.index.values, index=df.index).reset_index()
df
ciiu4n4 A0111 A0112 A0113
0 A0111 1.000000 0.953571 0.958149
1 A0112 0.953571 1.000000 0.899938
2 A0113 0.958149 0.899938 1.000000
Or, using df.insert
-
df = pd.DataFrame(v, columns=df.index.values)
df.insert(0, 'ciiu4n4', df.index)
df
ciiu4n4 A0111 A0112 A0113
0 A0111 1.000000 0.953571 0.958149
1 A0112 0.953571 1.000000 0.899938
2 A0113 0.958149 0.899938 1.000000