Let's say we have 10 millions of socks, and each of them has color sockcolor[i]
and is stored in drawer[i]
.
I'd like to count how many different colors there are in each of the 20 drawers.
To do this, I used a dictionary containing sets (sets use a hashtable under the hood which is handy to count unique elements).
The following code works, but it's quite slow (~10 seconds for 10 millions of socks).
How could we use Numpy techniques (vectorization?) or avoid the for
loop to speed up this computation?
import numpy as np
N = 10*1000*1000
drawer = np.random.randint(0, 20, N) # 20 drawers
sockcolor = np.random.randint(0, 100*1000, N) # 100k different colors
d = {}
for i, k in enumerate(drawer):
if k not in d: # can be simplified with collections.defaultdict but no gain here
d[k] = set()
d[k].add(sockcolor[i])
for k, s in d.items():
print(k, len(s))
Output:
16 99323
7 99330
0 99339
17 99364
14 99272
12 99303
11 99334
6 99362
19 99287
10 99282
4 99280
18 99325
3 99316
15 99303
5 99280
13 99357
2 99328
9 99290
1 99293
8 99293
You basically already have a mapping from drawers to sockcolors, but they are randomised and you want to organise them by drawer number.
The easiest thing to do is to first sort them by drawer numbers:
drawer_sort = np.argsort(drawer)
drawer = drawer[drawer_sort]
sockcolor = sockcolor[drawer_sort]
Now that they are sorted, there is no need to look for drawer number duplicates, you just have to find the indices at which the drawer numbers change, to form ranges, which are these:
changes, = np.where(drawer[1:]-drawer[:-1])
starts = np.concatenate([[0], changes+1])
ends = np.concatenate([changes, [len(drawer)]])
Now you can create your dictionary:
result = {drawer[start]: sockcolor[start:end] for start, end in zip(starts, ends)}
This way, the only itaration done in Python is the last line, which will be really fast if there are a small number of distinct drawer
values (in your case not more than 20).
The result can still have duplicate sockcolor
values, but that is easily solved in numpy:
result = {drawer: np.unique(sockcolors) for drawer, sockcolors in result.items()}