Search code examples
pythonfunctional-programmingouter-joinlevenshtein-distance

Using reduce, map or other function to avoid for loops in python


I have a program working for calculating the distance and then apply the k-means algorithm. I tested on a small list and it's working fine and fast, however, my original list is very big (>5000), so it's taking forever and I ended it up terminating the running. Can I use outer() or any other parallel function and apply it to the distance function to make this faster?? On the small set that I have:

strings = ['cosine cos', 'cosine', 'cosine???????', 'l1', 'l2', 'manhattan']

And its distance 3D array returns like this:

[[[ 0.          0.25        0.47826087  1.          1.          0.89473684]
  [ 0.25        0.          0.36842105  1.          1.          0.86666667]
  [ 0.47826087  0.36842105  0.          1.          1.          0.90909091]
  [ 1.          1.          1.          0.          0.5         1.        ]
  [ 1.          1.          1.          0.5         0.          1.        ]
  [ 0.89473684  0.86666667  0.90909091  1.          1.          0.        ]]]

Each line of the array above represents the distance for one item in the strings list. My way of doing it using the for loops is:

strings = ['cosine cos', 'cosine', 'cosine???????', 'l1', 'l2', 'manhattan']


data1 = []


for j in range(len(np.array(list(strings)))):

     for i in range(len(strings)):
       data1.append(1-Levenshtein.ratio(np.array(list(strings))[j], np.array(list(strings))[i]))

#n =(map(Levenshtein.ratio, strings))
#n =(reduce(Levenshtein.ratio, strings))
#print(n)



k=len(strings)
data2=np.asarray(data1)
arr_3d = data2.reshape((1,k,k))
print(arr_3d)

Where arr_3d is the array above. How can I use any of outer() or map() to replace the for loops above, because when the list strings is big, it's taking hours and never got the results even. I appreciate the help. Levenshtein.ratio is a built in funciton in python.


Solution

  • import numpy as np 
    
    strings = ['cosine cos', 'cosine', 'cosine???????', 'l1', 'l2', 'manhattan']
    
    k=len(strings)
    
    data = np.zeros((k,k))
    
    for i,string1 in enumerate(strings):
        for j,string2 in enumerate(strings):
            data[i][j] = 1-Levenshtein.ratio(string1, string2)
    
    print data
    

    No gains to be had with map or reduce here, the loops need to be run as @user2357112 mentions, however, this is cleaner and should run faster since it avoids the np.array(list(strings)) you were using throughout.