Search code examples
pythonpandaspython-itertools

Count by combination


I have a dataset showing which cities each vehicle has been to (as shown in df1 below).

I'm trying to create a list of two-city combinations based on df1, and then for each two-city combination count how many vehicles have been to that particular two-city combination (like df2 below).

I dug around but couldn't find a solution. Does anyone have a solution for this? (any help will be appreciated)

df1= pd.DataFrame([
[1,'A'],[1,'B'],[1,'C'],
[2,'A'],[2,'C'],[2,'C'],[2,'A'],
[3,'C'],[3,'B'],[3,'C'],[3,'B']],columns=['Vehicle_ID','City'])

df2= pd.DataFrame([['A,B',1],['B,C',2],['A,C',2]],
columns=['City_Combination','Vehicle_Count'])

Note:

(1) Order of cities visited doesn't matter. Eg. under the ('A,B') combination, vehicles that visited (A -> B) or (B -> A) or (A -> C -> B) will all be counted.

(2) Frequency of city visited doesn't matter. Eg. under the ('A,B') combination a vehicle that visited (A -> B -> A -> A) is still counted as 1 vehicle.


Solution

  • Here are two options. The first way is to group by the Vehicle_ID and for each group generate all the combinations of two cities. Collect the resulting city pairs and Vehicle_ID in a set of tuples (since we don't care about repeated city pairs) and then use the set to generate a new DataFrame. Then groupby the city pairs and count the distinct Vehicle_IDs:

    df1 = df1.drop_duplicates()
    data = set()
    for vid, grp in df1.groupby(['Vehicle_ID']):
        for c1, c2 in IT.combinations(grp['City'], 2):
            if c1 > c2:
                c1, c2 = c2, c1
            data.add((c1, c2, vid))
    df = pd.DataFrame(list(data), columns=['City_x', 'City_y', 'Vehicle_Count'])
    #   City_x City_y  Vehicle_Count
    # 0      B      C              3
    # 1      A      C              1
    # 2      B      C              1
    # 3      A      C              2
    # 4      A      B              1
    result = df.groupby(['City_x', 'City_y']).count()
    

    yields

                   Vehicle_Count
    City_x City_y               
    A      B                   1
           C                   2
    B      C                   2
    

    An alternative way is to merge df1 with itself:

    In [244]: df1 = df1.drop_duplicates()
    
    In [246]: df3 = pd.merge(df1, df1, on='Vehicle_ID', how='left'); df3
    Out[246]: 
        Vehicle_ID City_x City_y
    0            1      A      A
    1            1      A      B
    2            1      A      C
    3            1      B      A
    4            1      B      B
    5            1      B      C
    6            1      C      A
    7            1      C      B
    8            1      C      C
    9            2      A      A
    10           2      A      C
    11           2      C      A
    12           2      C      C
    13           3      C      C
    14           3      C      B
    15           3      B      C
    16           3      B      B
    

    Unfortunately for us, pd.merge generates the direct product of city pairs, so we need to remove rows where City_x >= City_y:

    In [247]: mask = df3['City_x'] < df3['City_y']
    In [248]: df3 = df3.loc[mask]; df3
    Out[249]: 
        Vehicle_ID City_x City_y
    1            1      A      B
    2            1      A      C
    5            1      B      C
    10           2      A      C
    15           3      B      C
    

    And now we can once again groupby City_x, City_y and count the result:

    In [251]: result = df3.groupby(['City_x', 'City_y']).count(); result
    Out[251]: 
                   Vehicle_ID
    City_x City_y            
    A      B                1
           C                2
    B      C                2
    

    import numpy as np
    import pandas as pd
    import itertools as IT
    
    def using_iteration(df1):
        df1 = df1.drop_duplicates()
        data = set()
        for vid, grp in df1.groupby(['Vehicle_ID']):
            for c1, c2 in IT.combinations(grp['City'], 2):
                if c1 > c2:
                    c1, c2 = c2, c1
                data.add((c1, c2, vid))
        df = pd.DataFrame(list(data), columns=['City_x', 'City_y', 'Vehicle_Count'])
        result = df.groupby(['City_x', 'City_y']).count()
        return result
    
    def using_merge(df1):
        df1 = df1.drop_duplicates()
        df3 = pd.merge(df1, df1, on='Vehicle_ID', how='left')
        mask = df3['City_x'] < df3['City_y']
        df3 = df3.loc[mask]
        result = df3.groupby(['City_x', 'City_y']).count()
        result = result.rename(columns={'Vehicle_ID':'Vehicle_Count'})
        return result
    
    def generate_df(nrows, nids, strlen):
        cities = (np.random.choice(list('ABCD'), nrows*strlen)
                  .view('|S{}'.format(strlen)))
        ids = np.random.randint(nids, size=(nrows,))
        return pd.DataFrame({'Vehicle_ID':ids, 'City':cities})
    
    df1 = pd.DataFrame([
        [1, 'A'], [1, 'B'], [1, 'C'],
        [2, 'A'], [2, 'C'], [2, 'C'], [2, 'A'],
        [3, 'C'], [3, 'B'], [3, 'C'], [3, 'B']], columns=['Vehicle_ID', 'City'])
    
    df = generate_df(10000, 50, 2)
    assert using_merge(df).equals(using_iteration(df))
    

    If df1 is small, using_iteration may be faster than using_merge. For example, with the df1 from the original post,

    In [261]: %timeit using_iteration(df1)
    100 loops, best of 3: 3.45 ms per loop
    
    In [262]: %timeit using_merge(df1)
    100 loops, best of 3: 4.39 ms per loop
    

    However, if we generate a DataFrame with 10000 rows and 50 Vehicle_IDs and 16 Citys, then using_merge may be faster than using_iteration:

    df = generate_df(10000, 50, 2)
    
    In [241]: %timeit using_merge(df)
    100 loops, best of 3: 7.73 ms per loop
    
    In [242]: %timeit using_iteration(df)
    100 loops, best of 3: 16.3 ms per loop
    

    Generally speaking, the more iterations required by the for-loops in using_iteration -- i.e. the more Vehicle_IDs and possible city pairs -- the more likely NumPy- or Pandas-based methods (such as pd.merge) will be faster.

    Note however, that pd.merge generates a bigger DataFrame than we ultimately need. So using_merge may require more memory than using_iteration. So at some point, for sufficiently big df1s, using_merge may require swap space which can make using_merge slower than using_iteration.

    So it is best to test using_iteration and using_merge (and other solutions) on your actual data to see what is fastest.