I have 2 arrays, for example:
import numpy as np
ts1 = np.array([[1,2,3,4,5],[1,2,3,4,5],[1,2,3,4,5],[1,2,3,4,5],[1,2,3,4,5]])
ts2 = np.array([[6,2,7,4,5],[1,2,3,4,5],[3,2,3,8,5],[2,2,3,1,5],[9,8,7,6,5]])
I want to calculate the correlation between each row of ts1 and the corresponding row of ts2, taking into account of two indices (initial and final element) for each row. These indices are stored in a pandas DataFrame, for example:
import pandas as pd
indices = pd.DataFrame({'in':[0,1,1,2,0],'fin':[4,3,2,3,3]})
I can do that with a simple for loop:
corrIdx = np.zeros((ts1.shape[0],))
for idx in range(ts1.shape[0]):
corrMatrix = np.corrcoef(ts1[idx,indices['in'][idx]:indices['fin'][idx]],ts2[idx,indices['in'][idx]:indices['fin'][idx]])
corrIdx[idx] = corrMatrix[0,1].item()
now I want to vectorize this for loop with numpy.vectorize.
I tried:
def calcCorr(idx,ts1,ts2,indices):
corrMatrix = np.corrcoef(ts1[idx,indices['in'][idx]:indices['fin'][idx]],ts2[idx,indices['in'][idx]:indices['fin'][idx]])
return corrMatrix[0,1].item()
myFunct = np.vectorize(calcCorr)
idx = np.arange(ts1.shape[0])
corrIdx = myFunct(idx,ts1,ts2,indices)
but I obtain the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3/dist-packages/numpy/lib/function_base.py", line 2163, in __call__
return self._vectorize_call(func=func, args=vargs)
File "/usr/lib/python3/dist-packages/numpy/lib/function_base.py", line 2241, in _vectorize_call
ufunc, otypes = self._get_ufunc_and_otypes(func=func, args=args)
File "/usr/lib/python3/dist-packages/numpy/lib/function_base.py", line 2201, in _get_ufunc_and_otypes
outputs = func(*inputs)
File "<stdin>", line 2, in calcCorr
IndexError: invalid index to scalar variable.
I get the same error even if I use 2 arrays instead of the dataframe. Any suggestions?
Update July 10, 2023
I tried to following suggestions by D.L, but I still have some problems to understand the exact way to correctly define the function to vectorize. The problem is that indices initial and final are different for each row of ts1 and ts2. So, given:
ini = np.array([0,1,1,2,0])
fin = np.array([4,3,2,3,3])
I tried:
def my_function(a,b,ini,fin):
result = np.corrcoef(a[ini:fin], b[ini:fin])[0,1]
return result
vfunc = np.vectorize(my_function, signature='(n),(n),(n),(n)->()')
r = vfunc(ts1,ts2,ini,fin)
but I have the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3/dist-packages/numpy/lib/function_base.py", line 2163, in __call__
return self._vectorize_call(func=func, args=vargs)
File "/usr/lib/python3/dist-packages/numpy/lib/function_base.py", line 2237, in _vectorize_call
res = self._vectorize_call_with_signature(func, args)
File "/usr/lib/python3/dist-packages/numpy/lib/function_base.py", line 2277, in _vectorize_call_with_signature
results = func(*(arg[index] for arg in args))
File "<stdin>", line 2, in my_function
TypeError: only integer scalar arrays can be converted to a scalar index
bit confused...
I found a possible solution finally. The signature parameter is actually the key, as suggested by D.L:
import numpy as np
ts1 = np.array([[1,2,3,4,5],[1,2,3,4,5],[1,2,3,4,5],[1,2,3,4,5],[1,2,3,4,5]])
ts2 = np.array([[6,2,7,4,5],[1,2,3,4,5],[3,2,3,8,5],[2,2,3,1,5],[9,8,7,6,5]])
ini = np.array([0,1,1,2,0])
fin = np.array([4,3,2,3,3])
def my_function(a,b,ini,fin):
result = np.corrcoef(a[ini:fin],b[ini:fin])[0,1]
return result
vfunc = np.vectorize(my_function,signature='(n),(n),(),()->()')
r = vfunc(ts1,ts2,ini,fin)
This way it works. Only note, the speed of execution is the same order of magnitude as list comprehension or the map function, so, at least in this case, numpy vectorize is no more efficient in terms of speed. Any comments on this?