Search code examples
pythonpandasnested-loops

how to pandas fast nested for loop for "non numeric" columns?


how to pandas fast nested for loop for "non numeric" columns? because this for loop is way to slow:

for i in range(len(df1[column_A]):
    for j in range(len(df2[column_A]):
        if df1[column_A][i] == df2[column_A][j]:
            df1[column_B][i] = df2[column_B][j]
        else:
            pass

so any other way to do it by pandas itself or other libraries?

UPDATE:

and main goal is:

input:

df1: 
     name  rpm    power
0   John   1500   high+
1   Mary   1400   high-
2  Sally   300    low-
3    Doe   700    medium-
4   July   1000   medium+


df2: 
     name  age
0  Peter   77
1  Sally   44
2  Micky   22
3  Sally   34
4  July    50
5   Bob    20


required output is:

 but i want it df2:
     name  age  rpm    power
0  Peter   77   0      NA
1  Sally   44   300    low-
2  Micky   22   0      NA
3  Sally   34   300    low-
4   July   50   1000   medium+
5    Bob   20   0      NA

i also add question in official pandas github: https://github.com/pandas-dev/pandas/issues/59824


Solution

  • The nested loop you provided result in O(n^2) complexity, making it slow for larger datasets . Looping over the same range for both i and j, which is unnecessary. Instead you can use pd.merge

    import pandas as pd
    
    # Merge file1 and file2 on column A
    merged_df = pd.merge(file1, file2, on='column_A') # assuming it is a pandas dataframe 
    
    # Update file1_column_B with matched values from file2
    file1_column_B = merged_df['column_B_y']
    
    
    

    pd.merge() function merges two DataFrames ( file1 and file2 ) based on a common column ( column_A ).Also by default, pd.merge() performs an inner merge, which means only rows with matching values in column_A are included in the resulting DataFrame.

    Time complexity of pd.merge is O(n + m) in best case scenario (concluding it) where n is the number of rows in the left DataFrame (file1) and m is the number of rows in the right DataFrame (file2). However, in the worst-case scenario (e.g., when there are many duplicate values in the merge column), the time complexity can be O(n × m).

    You can also use numPy argsort + searchsorted

    import numpy as np
    
    
    sorted = np.argsort(file2['column_A'])
    
    match = np.searchsorted(file2['column_A'][sorted], file1['column_A'])
    
    file1_column_B = file2['column_B'][sorted[match]]
    
    

    Time Complexity of above sort: argsort: O(n log n) searchsorted: O(m log n) Total: O(n log n + m log n)