Search code examples
pythonpandasdataframebigdatascalability

Optimizing a Pandas DataFrame Transformation to Link two Columns


Given the following df:

   SequenceNumber | ID | CountNumber | Side | featureA | featureB
0   0             | 0  |  3          | Sell |  4       |  2              
1   0             | 1  |  1          | Buy  |  12      |  45
2   0             | 2  |  1          | Buy  |  1       |  4
3   0             | 3  |  1          | Buy  |  3       |  36
4   1             | 0  |  1          | Sell |  5       |  11
5   1             | 1  |  1          | Sell |  7       |  12
6   1             | 2  |  2          | Buy  |  5       |  35

I want to create a new df such that for every SequenceNumber value, it takes the rows with the CountNumber == 1, and creates new rows where if the Side == 'Buy' then put their ID in a column named To. Otherwise put their ID in a column named From. Then the empty column out of From and To will take the ID of the row with the CountNumber > 1 (there is only one per each SequenceNumber value). The rest of the features should be preserved.

NOTE: basically each SequenceNumber represents one transactions that has either one seller and multiple buyers, or vice versa. I am trying to create a database that links the buyers and sellers where From is the Seller ID and To is the Buyer ID.

The output should look like this:

   SequenceNumber | From | To | featureA | featureB
0   0             | 0    |  1 |  12      |  45              
1   0             | 0    |  2 |  1       |  4
2   0             | 0    |  3 |  3       |  36
3   1             | 0    |  2 |  5       |  11
4   1             | 1    |  2 |  7       |  12

I implemented a method that does this, however I am using for loops which takes a long time to run on a large data. I am looking for a faster scalable method. Any suggestions?

Here is the original df:

df = pd.DataFrame({'SequenceNumber ': [0, 0, 0, 0, 1, 1, 1], 
                   'ID': [0, 1, 2, 3, 0, 1, 2], 
                   'CountNumber': [3, 1, 1, 1, 1, 1, 2],
                   'Side': ['Sell', 'Buy', 'Buy', 'Buy', 'Sell', 'Sell', 'Buy'],
                   'featureA': [4, 12, 1, 3, 5, 7, 5],
                   'featureB': [2, 45, 4, 36, 11, 12, 35]})

Solution

  • Initial response. To get the answer half complete. Split the data into sellers and buyers. Then merge it against itself on the sequence number:

    ndf = df.query('Side == "Sell"').merge(
        df.query('Side == "Buy"'), on='SequenceNumber', suffixes=['_sell', '_buy']) \
        .rename(columns={'ID_sell': 'From', 'ID_buy': 'To'})
    

    I then drop the side variable.

    ndf = ndf.drop(columns=[i for i in ndf.columns if i.startswith('Side')])
    

    This creates a very wide table:

       SequenceNumber  From  CountNumber_sell  featureA_sell  featureB_sell  To  CountNumber_buy  featureA_buy  featureB_buy
    0               0     0                 3              4              2   1                1            12            45
    1               0     0                 3              4              2   2                1             1             4
    2               0     0                 3              4              2   3                1             3            36
    3               1     0                 1              5             11   2                2             5            35
    4               1     1                 1              7             12   2                2             5            35
    

    This leaves you, however, with two featureA and featureB columns. I don't think your question clearly establishes which one takes precedence. Please provide more information on that.

    Is it select the side with the lower CountNumber? Is it when CountNumber == 1? If the latter, then just null out the entries at the merge stage, do the merge, and then forward fill your appropriate columns to recover the proper values.


    Re nulling. If you null the portions in featureA and featureB where the CountNumber is not 1, you can then create new version of those columns after the merge by forward filling and selecting.

    s = df.query('Side == "Sell"').copy()
    s.loc[s['CountNumber'] != 1, ['featureA', 'featureB']] = np.nan
    b = df.query('Side == "Buy"').copy()
    b.loc[b['CountNumber'] != 1, ['featureA', 'featureB']] = np.nan
    
    ndf = s.merge(
        b, on='SequenceNumber', suffixes=['_sell', '_buy']) \
        .rename(columns={'ID_sell': 'From', 'ID_buy': 'To'})
    ndf['featureA'] = ndf[['featureA_buy', 'featureA_sell']] \
        .ffill(axis=1).iloc[:, -1]
    ndf['featureB'] = ndf[['featureB_buy', 'featureB_sell']] \
        .ffill(axis=1).iloc[:, -1]
    
    ndf = ndf.drop(
        columns=[i for i in ndf.columns if i.startswith('Side') 
                 or i.endswith('_sell') or i.endswith('_buy')])
    

    The final version of ndf then is:

       SequenceNumber  From  To  featureA  featureB
    0               0     0   1      12.0      45.0
    1               0     0   2       1.0       4.0
    2               0     0   3       3.0      36.0
    3               1     0   2       5.0      11.0
    4               1     1   2       7.0      12.0