Given two files (IP's and Subnet Info), create file that associates each IP to a subnet

I've been struggling for a couple of days with the proper way to address this solution, and I am seeking some assistance.

I have two files and need to create a third that shows the relationship.

IP Address file - ip.csv
Subnet file - subnet.csv

I need to specify what subnet that each IP is in, and create a third file

The ip.csv file will contain about 1.5 million IP's and the subnet.csv file will contain around 140,000 subnets.

ip.csv file sample:

IP,Type
10.78.175.167,IPv4
10.20.3.56,IPv4

subnet.csv file sample:

Subnet,Netmask
10.176.122.136/30,255.255.255.252
10.20.3.0/24,255.255.254.0

Format of file I need to create:

Subnet,IP
10.20.3.0/24,10.20.3.56

I've tried to make use of things from these pages:

This is the code that I have tried. It works on small sets, but I'm having problems running it with the full set of files.

#!/usr/local/bin/python2.7
import csv
import ipaddress
import iptools
import re
import SubnetTree
import sys
from socket import inet_aton

testdir = '/home/test/testdir/'
iprelfile = testdir + 'relationship.csv'
testipsub = testdir + 'subnet.csv'
testipaddr = testdir + 'ip.csv'

o1 = open (iprelfile, "a")

# Subnet file
IPR = set()
o1.write('Subnet,IP\n')
with open(testipsub, 'rb') as master:
    reader = csv.reader(master)
    for row in reader:
        if 'Subnet' not in row[0]:
            # Convert string to unicode to be parsed with ipaddress module
            b = unicode(row[1])
            # Using ipaddress module to create list containing every IP in subnet
            n2 = ipaddress.ip_network(b)
            b1 = (list(n2.hosts()))
            # IP address file
            with open(testipaddr, 'rb') as ipaddy:
                readera = csv.reader(ipaddy)
                for rowa in readera:
                    if 'IP' not in rowa[0]:
                        bb = rowa[0]
                        for ij in b1:
                            # Convert to string for comparison
                            f = str(ij)
                            # If the IP address is in subnet range
                            if f == bb:
                                IPR.update([row[0] + ',' + bb + '\n'])


for ip in IPR:
    o1.write(ip + '\n')

# Closing the file
o1.close()

Solution

You could read all the subnets to memory and sort them by network address. This would allow you to use bisect to do a binary search in order to find the subnet for every IP. This only works if the subnets don't overlap each other, if they do you'll probably need to use segment tree.

import bisect
import csv
import ipaddress

def sanitize(ip):
    parts = ip.split('/', 1)
    parts[0] = '.'.join(str(int(x)) for x in parts[0].split('.'))

    return '/'.join(parts)

with open('subnet.csv') as subnet_f:
    reader = csv.reader(subnet_f)
    next(reader)    # Skip column names

    # Create list of subnets sorted by network address and
    # list of network addresses in the same order
    subnets = sorted((ipaddress.IPv4Network(sanitize(row[0])) for row in reader),
                     key=lambda x: x.network_address)
    network_addrs = [subnet.network_address for subnet in subnets]

with open('ip.csv') as ip_f, open('output.csv', 'w', newline='') as out_f:
    reader = csv.reader(ip_f)
    next(reader)

    writer = csv.writer(out_f)
    writer.writerow(['Subnet', 'IP'])

    for row in reader:
        ip = ipaddress.IPv4Address(sanitize(row[0]))
        index = bisect.bisect(network_addrs, ip) - 1

        if index < 0 or subnets[index].broadcast_address < ip:
            continue    # IP not in range of any networks
        writer.writerow([subnets[index], ip])

Output:

Subnet,IP
10.20.3.0/24,10.20.3.56

Above has time complexity of O(n log m) where n is the number of IPs and m number of networks. Note that it only runs with Python 3 since ipaddress is not included to Python 2.7. If you need to use Python 2.7 there are backports available.

Update The first goal for efficient solution is to find a way to process each individual IP in efficient manner. Looping through all subnets is terribly expensive so it won't do. It's much better to create a sorted list of first IP in each subnet. For given data it would look like this:

[IPv4Address('10.20.3.0'), IPv4Address('10.176.122.136')]

This will allow us to execute binary search in order to find index of IP address that is equal or lower than individual IP. For example when we search for IP 10.20.3.56 we use bisect.bisect to provide us the first index greater than IP and decrement it by one:

>>> l = [IPv4Address('10.20.3.0'), IPv4Address('10.176.122.136')]
>>> index = bisect.bisect(l, IPv4Address('10.20.3.56'))
>>> index
1
>>> l[index - 1]
IPv4Address('10.20.3.0')

Since we have stored the networks to another list which is in the same order we can use index to retrieve given subnet. Once we have the subnet we still need to check that the individual IP is equal or lower than the last IP within the subnet. If the individual IP is within the subnet then write a row to result, if not move to next IP.