Given two tables I'd like to conduct a lookup over all ips and find the network it belongs to:
I have two large tables:
and the following networks:
Regarding the ClientIP (First table) I thought of casting the whole column with ip_address
Regarding the second column (second table) I thought of casting the whole column with ip_network
Something like this:
import ipaddress
network = ipaddress.ip_network('99.96.0.0/13')
ip_obj = ipaddress.ip_address('99.87.29.96')
print(ip_obj in network)
and then conduct an apply function, but it is very slow especially for tables with this kind of size.
I noticed in some databases like KQL, there is a built-in support: ipv4-lookup
Is there any kind of builtin support for iplookup in polars? or in pyarrows? any suggestions?
Assuming the network dataframe's IpCidr
blocks are not overlapping, you could convert the IPv4 addresses to a pl.Int64
and get the max value within the CIDR block.
A function using only a pl.Expr
to convert an IPv4 address to pl.Int64
import polars as pl
def ip_addr4_int64_expr(ipv4_str_expr: pl.Expr):
return (
ipv4_str_expr.str.split(".")
.list.eval(
pl.element().cast(pl.Int64)
* (2 ** (8 * (pl.element().cum_count(reverse=True)))).cast(pl.Int64)
)
.list.sum()
)
A range of addresses can be derived from the CIDR's prefix by getting the number of available hosts and adding it to the base IPv4's Int64 representation.
cidr_split_ipv4_expr = pl.col("IpCidr").str.split("/").list.get(0)
cidr_prefix_expr = pl.col("IpCidr").str.split("/").list.get(1).cast(pl.Int64)
ip_cidr_df = ip_cidr_df.with_columns(
ip_addr4_int64_expr(cidr_split_ipv4_expr).alias("ip_addr4_int64"),
(
ip_addr4_int64_expr(cidr_split_ipv4_expr)
- 1
+ ((2 ** (32 - cidr_prefix_expr)).cast(pl.Int64))
).alias("cidr_ip_max"),
)
client_df = client_df.with_columns(
ip_addr4_int64_expr(pl.col("ClientIP")).alias("ip_addr4_int64"),
)
Using a join_asof
, a range lookup can be done. Then null out values that return above the max IP range.
client_df = (
client_df.sort("ip_addr4_int64")
.join_asof(ip_cidr_df.sort("ip_addr4_int64"), on="ip_addr4_int64")
.select(
"ClientIP",
"Timestamp",
pl.when(pl.col("ip_addr4_int64") <= pl.col("cidr_ip_max"))
.then(pl.col("Info"))
.alias("Info"),
)
)
Examples:
ip_cidr_df = pl.DataFrame(
{
"IpCidr": [
"99.96.0.0/13", "99.88.0.0/13", "1.0.136.0/22", "1.0.128.0/21",
"1.0.0.0/24", "10.0.0.0/8", "127.0.0.0/8", "172.16.0.0/12",
"192.168.0.0/16",
],
"Info": [
"ATT-INTERNET4", "ATT-INTERNET4", "TOT-NET TOT Public Company Limit",
"TOT-NET TOT Public Company Limit", "CLOUDFLARENET", "The 10.0.0.0/8 Range",
"The 127.0.0.0/8 Range", "The 172.16.0.0/12 Range", "The 192.168.0.0/16 Range",
],
}
)
client_df = pl.DataFrame(
{
"Timestamp": [
"2023-06-01 00:00:00", "2023-06-01 00:00:00", "2023-06-01 00:00:00",
"2023-06-01 00:00:00", "2023-06-30 23:59:00", "2023-06-30 23:59:00",
"2023-06-30 23:59:00",
],
"ClientIP": [
"1.0.0.14", "99.96.1.5", "99.87.29.96", "10.0.0.1", "127.0.0.1", "172.16.0.1", "192.168.0.1",
],
}
)
Output:
shape: (7, 3)
┌─────────────┬─────────────────────┬──────────────────────────┐
│ ClientIP ┆ Timestamp ┆ Info │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════════════╪═════════════════════╪══════════════════════════╡
│ 1.0.0.14 ┆ 2023-06-01 00:00:00 ┆ CLOUDFLARENET │
│ 10.0.0.1 ┆ 2023-06-01 00:00:00 ┆ The 10.0.0.0/8 Range │
│ 99.87.29.96 ┆ 2023-06-01 00:00:00 ┆ null │
│ 99.96.1.5 ┆ 2023-06-01 00:00:00 ┆ ATT-INTERNET4 │
│ 127.0.0.1 ┆ 2023-06-30 23:59:00 ┆ The 127.0.0.0/8 Range │
│ 172.16.0.1 ┆ 2023-06-30 23:59:00 ┆ The 172.16.0.0/12 Range │
│ 192.168.0.1 ┆ 2023-06-30 23:59:00 ┆ The 192.168.0.0/16 Range │
└─────────────┴─────────────────────┴──────────────────────────┘
Note: This answer assumes a dataframe consisting only of IPv4 addresses and no overlapping CIDR blocks in ip_cidr_df
.
The same logic could be applied by converting IPv6 addresses to a pl.Struct
consisting of pl.Int64
.