Two 2D numpy arrays are given (arr_all and arr_sub) whereas the second is a random subset of the first. I need to get the rows of the first one (arr_all) that are not included in the second one (arr_sub) based on an ID in one column that exist in both arrays. e.g.:
arr_all = array([[ x, y, z, id_1],
# [x, y, z, id_2],
# [x, y, z, id_3],
# [x, y, z, id_4],
# [x, y, z, id_5]])
arr_sub = array([[ x, y, z, id_1],
# [x, y, z, id_2],
# [x, y, z, id_5]])
wanted output:
arr_remain = array([[ x, y, z, id_3],
# [x, y, z, id_4]])
A working solution would be:
list_remain = []
for i in range(len(ds_all)):
if ds_all[i][3] not in ds_trees[:,3]:
list_remain.append(ds_all[i])
arr_remain = np.array(list_remain)
This solution however is unfortunately only good for a small dataset because of it's horrible slow runtime. Since my original dataset contains over 26 mio rows, this is not sufficient.
I tried to adapt solutions like this, this or this but I didn't manage to add the check if the ID exist in the other arrays column.
Here's one way:
arr_remain = arr_all[~np.in1d(arr_all[:,-1], arr_sub[:,-1])]
# or arr_remain = arr_all[~np.isin(arr_all[:,-1], arr_sub[:,-1])]
array([['x', 'y', 'z', 'id_3'],
['x', 'y', 'z', 'id_4']], dtype='<U4')