I have got the two following numpy arrays. The first one is a single column which contains the ID of tens of millions of points. The second one represents the points and their coordinates x,y. It has 3 columns that are : id, x, y. It also contains tens of millions of points.
import numpy as np
points = np.array([[id], [id], ..., [id]]) # array of points
coordinates = np.array([[id, x, y], ..., [id, x, y]]) # array of points and their coordinates
I would like to get all the rows of the array coordinates which ID figures in the array points. In SQL, it would be written SELECT * FROM coordinates WHERE id IN points (and keep the result ordered the same way as in the array points)
I've got the following constraints :
I've tried the following :
indexes = []
for i in range(len(points)):
indexes.append(np.where(coordinates[:,0] == points[i])[0][0]
result = coordinates[indexes, :]
Which is a solution with loops and that's what I want to avoid.
Could you provide some context about how you are loading all this data in Python? I believe that if you store the data in a better format from the get go, you could make such a query much faster.
For example, you can store coordinates in a pandas.Series object where the ids in coordinates
can be set as the index column for the corresponding [x, y] values:
import pandas as pd
points = # You can load these as you already have
ids_xy = # Load ids for the coordinates separately
coordinates_xy = # Load the x, y coordinates separately
coordinates = pd.Series(coordinates_xy, index=ids_xy)
For a one time conversion from your current format to this format (applicable if you are running some sort of ipynb and have already loaded the data), you can do:
coordinate_series = pd.Series([(x, y) for _, x, y in coordinates], index=coordinates[:, 0])
With this, you can index all the coordinates you want to query by coordinates.loc[points]