Search code examples
pythonnumpynumpy-ndarraypoint-clouds

Numpy (np.unique) is taking up so much of space and time for very large arrays, any efficient alternative?


I am using Numpy to do some downsampling processing on a pointcloud file. During this process I am using np.unique to get unique value counts from the array. This is a very large array mind you, with around 36 million 3-D points. Is there any other efficient alternative I can use? Or any other data structure I should shift to to do exactly what np.unique is doing here to make the process faster? Right now it takes around 300 seconds. I get that unique is already optimized, but is there any alternate data structure I could use to get a better result ? The array just contains points in x,y,z format.

I am attaching a snippet of my code below. Thanks in advance for the help, and please feel free to ask me any other information you might need. I tried changing the precision level, with no effect. I am using numpy 1.24.4 on Ubuntu 22.04.

import numpy as np
import time

points = np.random.random(size = (38867362,3))*10000

# print(points)
# print("points bytes:",points.nbytes)

start_time = time.time()
unique_points, inverse, counts = np.unique(((points - np.min(points, axis=0)) // 3).astype(int), axis=0, return_inverse=True, return_counts=True)
print("::INFO:: Total time taken: ", time.time()-start_time)

and so on,.


Solution

  • According to OP's ultimate goal, this is a solution for 1-D data (its easy to adjust, it was simply unclear if every dimensions is treated separately or at once).

    from collections import defaultdict
    from tqdm import tqdm  # for nice loop animations
    import numpy as np
    
    floats = np.random.random(size=(388)) * 100
    ints = floats.astype(np.int32)
    
    D = defaultdict(list)
    
    for f, i in tqdm(zip(floats, ints), total=len(floats)):
        D[i].append(f)
    
    means = []
    for key, vals in D.items():
        means.append(np.mean(vals))