Search code examples
pythonnumpysimilaritycosine-similarity

Randomly generate similar vectors?


I have the following vector:

import numpy as np
my_vector = np.array([0.001, -0.05, 0.3, 0.5, 0.01, -0.03])

Could someone suggest a way to randomly generate similar vectors, with just slightly different values? The desired output would be, for instance:

[0.002, -0.06, 0.2, 0.4, 0.02, -0.02]

To give some context, this vector represents a sample that I feed into a classification model. My plan is to randomly generate a set of similar samples and feed them into the same model to observe the variation in its output. The end goal is to verify whether the model generates similar outputs for similar samples.

I tried to Create random vector given cosine similarity and setting my desired cosine similarity to 1, but with this method I can only obtain one similar vector (see below). And I would need at least 10.

def rand_cos_sim(v, costheta):
# Form the unit vector parallel to v:
u = v / np.linalg.norm(v)

# Pick a random vector:
r = np.random.multivariate_normal(np.zeros_like(v), np.eye(len(v)))

# Form a vector perpendicular to v:
uperp = r - r.dot(u)*u

# Make it a unit vector:
uperp = uperp / np.linalg.norm(uperp)

# w is the linear combination of u and uperp with coefficients costheta
# and sin(theta) = sqrt(1 - costheta**2), respectively:
w = costheta*u + np.sqrt(1 - costheta**2)*uperp

return w


new_vector = rand_cos_sim(my_vector, 1)
print(new_vector)

# [ 0.00170622 -0.08531119  0.51186714  0.8531119   0.01706224 -0.05118671]

I do not have a particular similarity measure in mind, it could be either Euclidean, Cosine, whichever works best. Any suggestions most welcome.

Please note that the my_vector I provided is for illustration purposes, in reality my vectors will have different ranges of values depending on the model I am testing and different data.

Thank you.


Solution

  • You could generate random multiplicative factors by calling numpy.random.lognormal. Use mean=0 and a small value of sigma to generate random values near 1.

    For example,

    In [23]: my_vector = np.array([0.001, -0.05, 0.3, 0.5, 0.01, -0.03])                                                                 
    
    In [24]: a = np.random.lognormal(sigma=0.1, size=my_vector.shape)                                                                    
    
    In [25]: a                                                                                                                           
    Out[25]: 
    array([1.07162745, 0.99891183, 1.02511718, 0.85346562, 1.04191125,
           0.87158183])
    
    In [26]: a * my_vector                                                                                                               
    Out[26]: 
    array([ 0.00107163, -0.04994559,  0.30753516,  0.42673281,  0.01041911,
           -0.02614745])