So, continuing from the discussion @TheBlackCat and I were having in this answer, I would like to know the best way to pass arguments to a Numpy vectorized function. The function in question is defined thus:
vect_dist_funct = np.vectorize(lambda p1, p2: vincenty(p1, p2).meters)
where, vincenty
comes from the Geopy package.
I currently call vect_dist_funct
in this manner:
def pointer(point, centroid, tree_idx):
intersect = list(tree_idx.intersection(point))
if len(intersect) > 0:
points = pd.Series([point]*len(intersect)).values
polygons = centroid.loc[intersect].values
dist = vect_dist_funct(points, polygons)
return pd.Series(dist, index=intercept, name='Dist').sort_values()
else:
return pd.Series(np.nan, index=[0], name='Dist')
points['geometry'].apply(lambda x: pointer(point=x.coords[0], centroid=line['centroid'], tree_idx=tree_idx))
(Please refer to the question here: Labelled datatypes Python)
My question pertains to what happens inside the function pointer
. The reason I am converting points
to a pandas.Series
and then getting the values (in the 4th line, just under the if
statement) is to make it in the same shape as polygons. If I merely call points either as points = [point]*len(intersect)
or as points = itertools.repeat(point, len(intersect))
, Numpy complains that it "cannot broadcast arrays of size (n,2) and size (n,) together" (n is the length of intersect
).
If I call vect_dist_funct
like so: dist = vect_dist_funct(itertools.repeat(points, len(intersect)), polygons)
, vincenty
complains that I have passed it too many arguments. I am at a complete loss to understand the difference between the two.
Note that these are coordinates, therefore will always be in pairs. Here are examples of how point
and polygons
look like:
point = (-104.950752 39.854744) # Passed directly to the function like this.
polygons = array([(-104.21750802451864, 37.84052458697633),
(-105.01017084789603, 39.82012158954065),
(-105.03965315742742, 40.669867471420886),
(-104.90353460825702, 39.837631505433706),
(-104.8650601872832, 39.870796282334744)], dtype=object)
# As returned by statement centroid.loc[intersect].values
What is the best way to call vect_dist_funct
in this circumstance, such that I can have a vectorized call, and both Numpy and vincenty will not complain that I am passing wrong arguments? Also, techniques that result in minimum memory consumption, and increased speed are sought. The goal is to compute distance between the point to each polygon centroid.
np.vectorize
doesn't really help you here. As per the documentation:
The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.
In fact, vectorize
actively hurts you, since it converts the inputs into numpy arrays, doing an unnecessary and expensive type conversion and producing the errors you are seeing. You are much better off using a function with a for
loop.
It also is better to use a function rather than a lambda
for a to-level function, since it lets you have a docstring.
So this is how I would implement what you are doing:
def vect_dist_funct(p1, p2):
"""Apply `vincenty` to `p1` and each element of `p2`.
Iterate over `p2`, returning `vincenty` with the first argument
as `p1` and the second as the current element of `p2`. Returns
a numpy array where each row is the result of the `vincenty` function
call for the corresponding element of `p2`.
"""
return [vincenty(p1, p2i).meters for p2i in p2]
If you really want to use vectorize
, you can use the excluded
argument to not vectorize the p1
argument, or better yet set up a lambda
that wraps vincenty
and only vectorizes the second argument:
def vect_dist_funct(p1, p2):
"""Apply `vincenty` to `p1` and each element of `p2`.
Iterate over `p2`, returning `vincenty` with the first argument
as `p1` and the second as the current element of `p2`. Returns
a list where each value is the result of the `vincenty` function
call for the corresponding element of `p2`.
"""
vinc_p = lambda x: vincenty(p1, x)
return np.vectorize(vinc_p)(p2)