Search code examples
postgresqlmultidimensional-arrayindexingcube

postgresql index 100 dimensional and 25 million rows table


My task is to quickly find the nearest neighbor in 100 dimensional space. So I create a test table:

create extension cube;
create table vectors (id serial, vector cube);
insert into vectors select id, cube(ARRAY[round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000)]) from generate_series(1, 25000000) id;

Search request:

explain analyze SELECT * FROM vectors ORDER BY vector <-> '(705, 501, 321, 345, 591, 58, 229, 420, 341, 628, 84, 476, 700, 71, 815, 616, 45, 686, 886, 102, 378, 172, 263, 538, 665, 553, 475, 845, 540, 963, 893, 209, 479, 357, 914, 70, 415, 142, 490, 756, 770, 574, 232, 470, 645, 47, 86, 690, 733, 972, 792, 112, 144, 55, 650, 810, 608, 125, 655, 148, 88, 548, 357, 567, 905, 271, 637, 320, 413, 128, 76, 183, 702, 308, 653, 347, 355, 739, 37, 88, 711, 829, 200, 856, 884, 850, 665, 493, 975, 320, 641, 63, 869, 998, 630, 774, 269, 268, 94, 682)'::cube LIMIT 10;

Without an index, the request to find the nearest neighbor takes about 30 seconds.

Now we will create an index:

CREATE INDEX vectors_vector_idx ON vectors USING GIST (vector);

Repeat search request:

explain analyze SELECT * FROM vectors ORDER BY vector <-> '(705, 501, 321, 345, 591, 58, 229, 420, 341, 628, 84, 476, 700, 71, 815, 616, 45, 686, 886, 102, 378, 172, 263, 538, 665, 553, 475, 845, 540, 963, 893, 209, 479, 357, 914, 70, 415, 142, 490, 756, 770, 574, 232, 470, 645, 47, 86, 690, 733, 972, 792, 112, 144, 55, 650, 810, 608, 125, 655, 148, 88, 548, 357, 567, 905, 271, 637, 320, 413, 128, 76, 183, 702, 308, 653, 347, 355, 739, 37, 88, 711, 829, 200, 856, 884, 850, 665, 493, 975, 320, 641, 63, 869, 998, 630, 774, 269, 268, 94, 682)'::cube LIMIT 10;
Limit  (cost=0.55..55.59 rows=10 width=820) (actual time=894342.029..1454440.760 rows=10 loops=1)
->  Index Scan using vectors_vector_idx0 on vectors  (cost=0.55..137606356.86 rows=24999816 width=820) (actual time=894342.027..1454440.754 rows=10 loops=1)
     Order By: (vector <-> '(705, 501, 321, 345, 591, 58, 229, 420, 341, 628, 84, 476, 700, 71, 815, 616, 45, 686, 886, 102, 378, 172, 263, 538, 665, 553, 475, 845, 540, 963, 893, 209, 479, 357, 914, 70, 415, 142, 490, 756, 770, 574, 232, 470, 645, 47, 86, 690, 733, 972, 792, 112, 144, 55, 650, 810, 608, 125, 655, 148, 88, 548, 357, 567, 905, 271, 637, 320, 413, 128, 76, 183, 702, 308, 653, 347, 355, 739, 37, 88, 711, 829, 200, 856, 884, 850, 665, 493, 975, 320, 641, 63, 869, 998, 630, 774, 269, 268, 94, 682)'::cube)
 Planning time: 0.131 ms
 Execution time: 1454440.849 ms
(5 rows)

Now the query is executed about 20 minutes. How can I speed up the search with indexing?


Solution

  • The problem was associated with a small amount of RAM(64 GB) for this task. It looks like the table is fully load into RAM and then there is a search. With indexes the table weighs 100 GB.