Search code examples
arraysnumpyloopswhere-clause

Is there a fast way to get a subsection of an array based on a research criteria?


I have got the two following numpy arrays. The first one is a single column which contains the ID of tens of millions of points. The second one represents the points and their coordinates x,y. It has 3 columns that are : id, x, y. It also contains tens of millions of points.

import numpy as np

points = np.array([[id], [id], ..., [id]])  # array of points
coordinates = np.array([[id, x, y], ..., [id, x, y]]) # array of points and their coordinates

I would like to get all the rows of the array coordinates which ID figures in the array points. In SQL, it would be written SELECT * FROM coordinates WHERE id IN points (and keep the result ordered the same way as in the array points)

I've got the following constraints :

  • the IDs are not sorted the same way
  • I'm searching for a solution without any loops on indexes (to get a faster algorithm)

I've tried the following :


indexes = []
for i in range(len(points)):
    indexes.append(np.where(coordinates[:,0] == points[i])[0][0]
result = coordinates[indexes, :]

Which is a solution with loops and that's what I want to avoid.


Solution

  • Could you provide some context about how you are loading all this data in Python? I believe that if you store the data in a better format from the get go, you could make such a query much faster.

    For example, you can store coordinates in a pandas.Series object where the ids in coordinates can be set as the index column for the corresponding [x, y] values:

    import pandas as pd
    points = # You can load these as you already have
    ids_xy = # Load ids for the coordinates separately
    coordinates_xy = # Load the x, y coordinates separately
    coordinates = pd.Series(coordinates_xy, index=ids_xy)
    

    For a one time conversion from your current format to this format (applicable if you are running some sort of ipynb and have already loaded the data), you can do:

    coordinate_series = pd.Series([(x, y) for _, x, y in coordinates], index=coordinates[:, 0])
    

    With this, you can index all the coordinates you want to query by coordinates.loc[points]