I'm analyzing large amounts point cloud data created by a laser scanner. In the third step I remove points based on their z-value but my function is really slow.
Import The data is imported from a .csv file using pandas. The imported dataframe 'df' contains the data for X,Y,Z. Example: df has a shape [300,1001]. Then X is the first third of df. X = df.iloc[:99,1:], Y is df.iloc[100:199,1:] and so on. The first column (index) is irrelevant. One row in X,Y,Z corresponds to the data of a single scan.
Convert to NumPy The dataframe 'df' contains many empty fields ''. Therefore I change the data structure to a NumPy array 'A' of shape (N,3) in which every row represents a single point. All points containing empty values are deleted.
Remove points based on max. height of a scan. I'm only interested in the points slightly below the maximum of each scan. I use my function 'in_max_height' to create a mask of all points within the allowed range.
Here's my code:
def in_max_height(A,hMax):
# get unique x values
unique_x = np.unique(A[:,0])
# create an empty mask array with the same shape as A
mask = np.zeros_like(A[:,2], dtype=bool)
# iterate over unique x and find the max. z-value
for x in unique_x:
zMax = np.max(A[A[:,0] == x, 2])
mask[A[:,0] == x] = ~(A[A[:,0] == x, 2] < zMax - hMax)
return mask
A = A[in_max_height(A,hMax=1)] # apply max. layer height
I tried to remove the low points after step 1 but I couldn't figure out how to ignore the index column of the dataframe.
Right now with an average point cloud consisting of about 375,000 points my function takes about 11 s to finish. I would like to learn how to fundamentally tackle these big data problems.
I admit that my code is not optimal but it's work faster than 11s on my laptop:
import random
import numpy as np
import time
def get_random_point():
i = 1950
return (random.randint(0, i), random.randint(0, i), random.randint(0, i/10))
# Construct test array with 375000 points and 1950 unique values
test_array = np.array([get_random_point() for x in range(375000)],dtype=np.int64)
print(test_array.shape)
(375000, 3)
start = time.time()
# Sort on first and last column decreasing order
tsorted = test_array[np.lexsort((test_array[:,2], test_array[:,0]))][::-1]
res = []
u = tsorted[0][0]
z_max = tsorted[0][2]
hmax = 1
for x in tsorted:
if x[0] != u or not res:
u = x[0]
z_max = x[2]
res.append(x)
else:
if x[2] + hmax >= z_max:
res.append(x)
res = np.array(res)
print(time.time() - start)
# in secs
0.47696924209594727