I have a large dataset that looks like this:
The columns are: IP Start, IP End, Location. It is essentially referring to all IPs in the range between the start and end ip.
Looking to store this in a key-value store (Cloudflare Workers KV) however there is no query functionality. I can only do a get(key)
.
Is there a way I can structure this data (without a row for each single ip) where I can easily fetch the correct row for any given ip?
Some ideas:
The list()
method with limit: 1
would return the first item in a list range, which could make it easier to "scan" the table. list()
only returns keys, not values, but it also returns metadata; if your value fits in metadata, you can put in there and list()
will return it. Unfortunately, list()
does not let you specify a range, only a prefix, so it might be a bit tricky to formulate queries. Also, list()
operations are more expensive than get()
both in terms of billing and performance.
Another possible design might be to batch your data by subnet, e.g. store a key for each /24
, where the value contains all the ranges that are under that /24
(I'm using /24
as an example, but a larger size might make sense depending on the dataset). Then you only have to do one lookup for a particular IP address, looking up its surrounding subnet. As an additional optimization, if it's common to see IP ranges that are much larger than a single /24
, you could, for example, store those as /16
s, and then on every lookup you would need to check both the /24
and /16
keys. Or, perhaps there are few enough "large" ranges that you could embed them directly into your Worker script and avoid the need to do a lookup. As an added benefit, this approach would likely utilize the cache better, if you are looking up IPs of clients. Clients in the same subnet would tend to be in the same location and hit the same Cloudflare colo, so if they all map to the same key in your KV namespace, then the data would tend to be warm in cache at that colo.