Search code examples
python-3.xpandasmergeleft-joinip-address

Pandas merge dataframes on ip-address by range


I have two dataframes containing some ip information that I'd like to merge (equivalent to a left join in sql). The dataframes have the following fields:

df1: ["company","ip","actions"]  
df2: ["ip_range_start","ip_range_end","country","state","city"]

The result dataframe should have the headers: ["company","ip","actions","country","state","city"]. The problem here is my merge criteria. df1 contains a single ip that I'd like to use to pull the country, state, and city information from df2.

This single ip will fall into one of the ranges specified by df2's "ip_range_start" and "ip_range_end" fields. I'm not sure how to accomplish this as a normal merge/join will obviously not do the trick as there's no matching values between df1 and df2.

My question seems very similar to this one, but different enough to warrant a separate question: Pandas: how to merge two dataframes on offset dates?


Solution

  • assume you have the following data frames:

    In [5]: df1
    Out[5]:
      company           ip actions
    0   comp1    10.10.1.2    act1
    1   comp2   10.10.2.20    act2
    2   comp3   10.10.3.50    act3
    3   comp4  10.10.4.100    act4
    
    In [6]: df2
    Out[6]:
      ip_range_start ip_range_end   country   state   city
    0      10.10.2.1  10.10.2.254  country2  state2  city2
    1      10.10.3.1  10.10.3.254  country3  state3  city3
    2      10.10.4.1  10.10.4.254  country4  state4  city4
    

    we can create a vectorized function which will calculate numerical IP representation similar to int(netaddr.IPAddress('192.0.2.1')):

    def ip_to_int(ip_ser):
        ips = ip_ser.str.split('.', expand=True).astype(np.int16).values
        mults = np.tile(np.array([24, 16, 8, 0]), len(ip_ser)).reshape(ips.shape)
        return np.sum(np.left_shift(ips, mults), axis=1)
    

    let's convert all IPs to their numerical representations:

    df1['_ip'] = ip_to_int(df1.ip)
    df2[['_ip_range_start','_ip_range_end']] = df2.filter(like='ip_range').apply(lambda x: ip_to_int(x))
    
    In [10]: df1
    Out[10]:
      company           ip actions        _ip
    0   comp1    10.10.1.2    act1  168427778
    1   comp2   10.10.2.20    act2  168428052
    2   comp3   10.10.3.50    act3  168428338
    3   comp4  10.10.4.100    act4  168428644
    
    In [11]: df2
    Out[11]:
      ip_range_start ip_range_end   country   state   city  _ip_range_start  _ip_range_end
    0      10.10.2.1  10.10.2.254  country2  state2  city2        168428033      168428286
    1      10.10.3.1  10.10.3.254  country3  state3  city3        168428289      168428542
    2      10.10.4.1  10.10.4.254  country4  state4  city4        168428545      168428798
    

    now let's add a new column to the df1 DF, which will contain an index of the first matching IP interval from the df2 DF:

    In [12]: df1['x'] = (df1._ip.apply(lambda x: df2.query('_ip_range_start <= @x <= _ip_range_end')
       ....:                                       .index
       ....:                                       .values)
       ....:                   .apply(lambda x: x[0] if len(x) else -1))
    
    In [14]: df1
    Out[14]:
      company           ip actions        _ip  x
    0   comp1    10.10.1.2    act1  168427778 -1
    1   comp2   10.10.2.20    act2  168428052  0
    2   comp3   10.10.3.50    act3  168428338  1
    3   comp4  10.10.4.100    act4  168428644  2
    

    finally we can merge both DFs:

    In [15]: (pd.merge(df1.drop('_ip',1),
       ....:           df2.filter(regex=r'^((?!.?ip_range_).*)$'),
       ....:           left_on='x',
       ....:           right_index=True,
       ....:           how='left')
       ....:    .drop('x',1)
       ....: )
    Out[15]:
      company           ip actions   country   state   city
    0   comp1    10.10.1.2    act1       NaN     NaN    NaN
    1   comp2   10.10.2.20    act2  country2  state2  city2
    2   comp3   10.10.3.50    act3  country3  state3  city3
    3   comp4  10.10.4.100    act4  country4  state4  city4
    

    Let's compare the speed of the standard int(IPAddress) to our function (we will use 4M rows DF for the comparison):

    In [21]: big = pd.concat([df1.ip] * 10**6, ignore_index=True)
    
    In [22]: big.shape
    Out[22]: (4000000,)
    
    In [23]: big.head(10)
    Out[23]:
    0      10.10.1.2
    1     10.10.2.20
    2     10.10.3.50
    3    10.10.4.100
    4      10.10.1.2
    5     10.10.2.20
    6     10.10.3.50
    7    10.10.4.100
    8      10.10.1.2
    9     10.10.2.20
    Name: ip, dtype: object
    
    In [24]: %timeit
    %timeit  %%timeit
    
    In [24]: %timeit big.apply(lambda x: int(IPAddress(x)))
    1 loop, best of 3: 1min 3s per loop
    
    In [25]: %timeit ip_to_int(big)
    1 loop, best of 3: 25.4 s per loop
    

    Conclusion: our function is approx. 2.5 times faster