Search code examples
postgresqlindexingip-address

How to optimize IP Address search query


I am trying to optimize ip address geolocation queries on the following raw schema (dataset is from an https://ipinfo.io/ free trial):

CREATE TABLE raw_geolocations (
    start_ip INET NOT NULL,
    end_ip INET NOT NULL,
    join_key CIDR NOT NULL,
    city TEXT NOT NULL,
    region TEXT,
    country TEXT NOT NULL,
    lat NUMERIC NOT NULL,
    lng NUMERIC NOT NULL,
    postal TEXT,
    timezone TEXT NOT NULL
);

Just querying with:

select *
from unnest(array[
    inet '<ip_address_string>'
]) ip_address
left join raw_geolocations g on ip_address between start_ip and end_ip;

was terribly slow (between 50s and 120s for single ip address queries).

So I followed a "guide" here and migrated raw_geolocations to a new table called ip_geolocations with a gist index created on the ip_segment field:

create type iprange as range (subtype=inet);

create table ip_geolocations(
    id bigserial primary key not null,
    ip_segment iprange not null,
    join_key cidr not null,
    city TEXT NOT NULL,
    region TEXT,
    country TEXT NOT NULL,
    lat NUMERIC NOT NULL,
    lng NUMERIC NOT NULL,
    postal TEXT,
    timezone TEXT NOT NULL
);

insert into ip_geolocations(ip_segment, join_key, city, region, country, lat, lng, postal, timezone)
select iprange(start_ip, end_ip, '[]'), join_key, city, region, country, lat, lng, postal, timezone
from raw_geolocations;

create index gist_idx_ip_geolocations_ip_segment on ip_geolocations USING gist (ip_segment);

This was great for single ip address queries, bringing execution time to between 20ms and 200ms, however, was fairly slow for bulk queries, taking around 1.5s to 3s for 100 ip addresses at once.

Sample query:

select *
from unnest(array[
    inet '<ip_address_string>'
]) ip_address
left join ip_geolocations g on g.ip_segment @> ip_address;

A few problems as I dove into other optimizations:


I then decided to look into gin indexes, but ran into the following postgres error when running the following command:

create index test_idx_geolocations on ip_geolocations using gin (ip_segment);

ERROR:  operator class "inet_ops" does not exist for access method "gin"
Time: 2.173 ms

I am struggling to understand the instructions for how to use inet_ops as an operator class for either the iprange type I defined or even inet and cidr types. Additionally, I tried to define my own operator class to no avail:

create operator class gin_iprange_ops default for type iprange
using gin as
    OPERATOR        7       @> (iprange, inet),
    OPERATOR        8       <@ (inet, iprange);

ERROR:  operator does not exist: iprange @> inet

This error doesn't make much sense to me since the sample query above works with the @> operator.


Whilst being blocked, I decided to see if just playing around with postgres configurations would improve query times, so I made the following changes:

alter table ip_geolocations set (parallel_workers = 4);

set max_parallel_maintenance_workers to 4;

set maintenance_work_mem to '1 GB';

This seemed to non-linearly improve index creation time, but didn't have any noticeable effects when it came to query execution time.


Lastly, there are other ip address tables of a similar form:

CREATE TABLE raw_<other_table> (
    start_ip INET NOT NULL,
    end_ip INET NOT NULL,
    join_key CIDR NOT NULL,
    ... <other_fields>
);

that I want to be able to join on. I followed a similar set of steps for each of these other tables (3 of them) and migrated them to ip_<other_table> tables with iprange fields and gist indices.

Unfortunately, joining these tables on the join_key was still incredibly slow.

Any help is much appreciated. I am very new to the world of IP address queries (range contains queries), custom operators classes, index types (gist and gin), so any pointers or documentation that might help me would be great.


Edits

  1. Query plan where I just check for the ip_address being in the ip_segment: iprange range:
explain analyze select *
from unnest(array[
    inet '98.237.137.99'
]) ip_address
left join ip_geolocations g on g.ip_segment @> ip_address
left join ip_companies c on c.ip_segment @> ip_address
left join ip_carriers cr on cr.ip_segment @> ip_address
left join ip_privacy_detections pd on pd.ip_segment @> ip_address;
                                                                                     QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop Left Join  (cost=1930.48..7972624532924773.00 rows=931561551811655 width=412) (actual time=562.324..589.083 rows=1 loops=1)
   ->  Nested Loop Left Join  (cost=23.78..1160195946065.69 rows=116337334725 width=356) (actual time=287.962..314.719 rows=1 loops=1)
         ->  Nested Loop Left Join  (cost=0.84..30592828311.99 rows=1694915933 width=293) (actual time=282.013..308.768 rows=1 loops=1)
               ->  Nested Loop Left Join  (cost=0.42..515233.65 rows=27965 width=179) (actual time=270.532..297.172 rows=1 loops=1)
                     ->  Function Scan on unnest ip_address  (cost=0.00..0.01 rows=1 width=32) (actual time=0.006..0.008 rows=1 loops=1)
                     ->  Index Scan using idx_ip_companies_ip_segment on ip_companies c  (cost=0.42..513835.38 rows=139826 width=147) (actual time=270.519..297.157 rows=1 loops=1)
                           Index Cond: (ip_segment @> ip_address.ip_address)
               ->  Index Scan using idx_ip_geolocations_ip_segment on ip_geolocations g  (cost=0.42..1090919.64 rows=303041 width=114) (actual time=11.474..11.588 rows=1 loops=1)
                     Index Cond: (ip_segment @> ip_address.ip_address)
         ->  Bitmap Heap Scan on ip_carriers cr  (cost=22.94..663.04 rows=343 width=63) (actual time=5.939..5.939 rows=0 loops=1)
               Recheck Cond: (ip_segment @> ip_address.ip_address)
               ->  Bitmap Index Scan on test_idx_cr  (cost=0.00..22.85 rows=343 width=0) (actual time=5.935..5.935 rows=0 loops=1)
                     Index Cond: (ip_segment @> ip_address.ip_address)
   ->  Bitmap Heap Scan on ip_privacy_detections pd  (cost=1906.70..68119.89 rows=40037 width=56) (actual time=274.352..274.353 rows=0 loops=1)
         Recheck Cond: (ip_segment @> ip_address.ip_address)
         ->  Bitmap Index Scan on idx_ip_privacy_detections_ip_segment  (cost=0.00..1896.69 rows=40037 width=0) (actual time=274.349..274.350 rows=0 loops=1)
               Index Cond: (ip_segment @> ip_address.ip_address)
 Planning Time: 0.739 ms
 Execution Time: 589.823 ms
(19 rows)

Note that for each of the 4 tables being queried, I built a gist index on their ip_segment fields.

Each of these left joins could honestly be separate queries that are associated together in the application layer, but in an ideal world I can get the rds to handle associations (joins) for me.

Running the above query for just 1 ip address takes 26.376 ms. For around 100 random ip addresses it takes around 11309.123 ms (11 seconds) (and improves to around 1.8 seconds after running the same query a few times).

  1. Query plan where I use the provided cidr join_key field for table joins:
explain analyze select *
from unnest(array[
    inet '98.237.137.99'
]) ip_address
left join ip_geolocations g on g.ip_segment @> ip_address
left join ip_companies c on c.join_key = g.join_key
left join ip_carriers cr on cr.join_key = g.join_key
left join ip_privacy_detections pd on pd.join_key = g.join_key;
                                                                                             QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Merge Right Join  (cost=13413427.68..762295480.26 rows=49873343109 width=412) (actual time=53725.909..53726.119 rows=121 loops=1)
   Merge Cond: ((c.join_key)::inet = (g.join_key)::inet)
   ->  Sort  (cost=10678179.38..10748092.42 rows=27965218 width=147) (actual time=45444.278..46291.865 rows=4939342 loops=1)
         Sort Key: c.join_key
         Sort Method: external merge  Disk: 4415120kB
         ->  Seq Scan on ip_companies c  (cost=0.00..910715.18 rows=27965218 width=147) (actual time=0.014..4597.313 rows=27965218 loops=1)
   ->  Materialize  (cost=2735248.30..3478041.50 rows=41149314 width=265) (actual time=6963.430..6963.598 rows=121 loops=1)
         ->  Merge Left Join  (cost=2735248.30..3375168.21 rows=41149314 width=265) (actual time=6963.426..6963.502 rows=121 loops=1)
               Merge Cond: ((g.join_key)::inet = (pd.join_key)::inet)
               ->  Merge Left Join  (cost=1112932.44..1115281.67 rows=124973 width=209) (actual time=80.721..80.725 rows=1 loops=1)
                     Merge Cond: ((g.join_key)::inet = (cr.join_key)::inet)
                     ->  Sort  (cost=1103325.02..1103476.54 rows=60608 width=146) (actual time=28.093..28.094 rows=1 loops=1)
                           Sort Key: g.join_key
                           Sort Method: quicksort  Memory: 25kB
                           ->  Nested Loop Left Join  (cost=0.42..1093950.06 rows=60608 width=146) (actual time=27.717..28.086 rows=1 loops=1)
                                 ->  Function Scan on unnest ip_address  (cost=0.00..0.01 rows=1 width=32) (actual time=0.016..0.017 rows=1 loops=1)
                                 ->  Index Scan using idx_ip_geolocations_ip_segment on ip_geolocations g  (cost=0.42..1090919.64 rows=303041 width=114) (actual time=27.695..28.062 rows=1 loops=1)
                                       Index Cond: (ip_segment @> ip_address.ip_address)
                     ->  Materialize  (cost=9607.42..9950.61 rows=68639 width=63) (actual time=44.198..50.257 rows=20609 loops=1)
                           ->  Sort  (cost=9607.42..9779.01 rows=68639 width=63) (actual time=44.195..47.308 rows=20609 loops=1)
                                 Sort Key: cr.join_key
                                 Sort Method: external merge  Disk: 5120kB
                                 ->  Seq Scan on ip_carriers cr  (cost=0.00..1510.39 rows=68639 width=63) (actual time=0.486..12.759 rows=68639 loops=1)
               ->  Materialize  (cost=1622315.86..1662352.94 rows=8007417 width=56) (actual time=5143.369..6417.708 rows=4015488 loops=1)
                     ->  Sort  (cost=1622315.86..1642334.40 rows=8007417 width=56) (actual time=5143.366..5843.105 rows=4015488 loops=1)
                           Sort Key: pd.join_key
                           Sort Method: external merge  Disk: 439120kB
                           ->  Seq Scan on ip_privacy_detections pd  (cost=0.00..156763.17 rows=8007417 width=56) (actual time=0.802..845.180 rows=8007417 loops=1)
 Planning Time: 12.618 ms
 Execution Time: 54800.482 ms
(30 rows)

Note that for each of the 4 tables being queried, I built a gist index on the pair of fields (ip_segment range_ops, join_key inet_ops) fields.

This query took around 59 seconds to execute for even a single ip address, which is abysmal. I wonder if it has to do with using range_ops instead of inet_ops for the ip_segment gist index field.

What confuses me is why the gist indexes I built on join_key and ip_segment weren't even used for the 3 tables that were not ip_geolocations.

  1. One thing to note is that I am comparing this to ipinfo's actual API which for the same set of ip addresses comes in at < 650 ms for each full request (for 100 ip addresses per batch) (meaning that there is a lot of room for improvement).

Solution

  • Turns out the best approach was actually to create a BTREE index on just the start_ip field after converting ip columns to INET. Then the query was a simple check of <= or BETWEEN.