Search code examples
pythonpandasdataframeconditional-statementsnested-loops

Pandas: Loop within Data frame Rows and return the matched result


Below is a Data frame with two set of values A & B in column 'ID'. Need to find the smallest value from previous rows in the same columns 'MT' & 'Price', if there is a greater price value than the current price value, the loop should break and perform this condition for the following rows. The matched condition values are required to be populated in different columns, as sample output is shown below,

Python
df = {'ID': ['A', 'A', 'A', 'A','A','A','B', 'B', 'B', 'B','B','B'],
        'Date': ['01-05-2023','02-05-2023', '03-05-2023', '04-05-2023', '06-05-2023','07-05-2023','01-05-2023','02-05-2023', '03-05-2023', '04-05-2023', '06-05-2023','07-05-2023'],
      'MT': [2, 5, 10, 15, 20, 30,2, 5, 10, 15, 20, 30],
        'Price': [300, 100.5, 200, 150.35, 250, 90,300, 100.5, 200, 150.35, 250, 90,]}

df = pd.DataFrame(data)

print(df)

Available Data frame:
*******************
   ID        Date  MT   Price
0   A  01-05-2023   2  300.00
1   A  02-05-2023   5  100.50
2   A  03-05-2023  10  200.00
3   A  04-05-2023  15  150.35
4   A  06-05-2023  20  250.00
5   A  07-05-2023  30   90.00
6   B  01-05-2023   2  300.00
7   B  02-05-2023   5  100.50
8   B  03-05-2023  10  200.00
9   B  04-05-2023  15  150.35
10  B  06-05-2023  20  250.00
11  B  07-05-2023  30   90.00

Output Required:
*******************
    ID  Date        MT  Price       Matched_Price       Date_Values
0   A   01-05-2023  2   300         
1   A   02-05-2023  5   100.5           
2   A   03-05-2023  10  200        100.5                02-05-2023
3   A   04-05-2023  15  150.35          
4   A   06-05-2023  20  250     100.5,200,150.35        02-05-2023,03-05-2023,04-05-2023
5   A   07-05-2023  30  90          
6   B   01-05-2023  2   300         
7   B   02-05-2023  5   100.5                       
8   B   03-05-2023  10  200         100.5               02-05-2023
9   B   04-05-2023  15  150.35          
10  B   06-05-2023  20  250         100.5,200,150.35    02-05-2023,03- 
 05-2023,04-05-2023     
11  B   07-05-2023  30  90  

The below provided code works fine in identifying all the smaller values in the column, but the second part of the condition i.e) if there is a greater price value than the current price value, the loop should break and perform this condition for the following rows, needs to be applied.

for i, row in df.iterrows(): dfa = df.iloc[:i, :] # slice with rows above current dfa = dfa[(dfa.ID==row.ID) & (dfa.MT < row.MT) & (dfa.Price< row.Price)] # matched rows df.loc[i, 'matched_Price'] = ','.join(map(str, dfa.Price)) df.loc[i, 'matched_dates'] = ','.join(map(str, dfa.Date))


Solution

  • I usually prefer to do these things within the .apply() function rather than in a loop. It's not wrong to use a loop, but I find it can make things harder to deal with in pandas world. It is also usually slower than using apply or other better vectorized solutions.

    Here is what I came up with:

    import pandas as pd
    
    df = {'ID': ['A', 'A', 'A', 'A','A','A','B', 'B', 'B', 'B','B','B'],
            'Date': ['01-05-2023','02-05-2023', '03-05-2023', '04-05-2023', '06-05-2023','07-05-2023','01-05-2023','02-05-2023', '03-05-2023', '04-05-2023', '06-05-2023','07-05-2023'],
          'MT': [2, 5, 10, 15, 20, 30,2, 5, 10, 15, 20, 30],
            'Price': [300, 100.5, 200, 150.35, 250, 90,300, 100.5, 200, 150.35, 250, 90,]}
    
    df = pd.DataFrame(df)
    
    # Store in a copy so we can manipulate it and maintain the source
    mod_df = df.copy()
    
    # Shift gives us a series "pushed" along one entry. Allows comparison with the previous entry
    mod_df["shift"] = df["Price"].shift()
    # diff is where we will evaluate the dataframe. Where this diff is positive within
    # a product ID is the rows you want to focus on.
    mod_df["diff"] = mod_df["Price"] - mod_df["shift"]
    
    
    # Function for doing the heavy lifting
    def pull_values_w_price_less_than(row: pd.Series, column: str):
        # The case that the price decreased or stayed the same
        if row["diff"] <= 0:
            return pd.NA
        # Price increased (where we want to eval)
        else:
            # Entries relevant to this row
            relevant_df = mod_df[
                # Product ID is the same
                (mod_df["ID"] == row["ID"])
                # Price is less than this one
                & (mod_df["Price"] < row["Price"])
                # And it happened before this one
                & (mod_df["Date"] < row["Date"])
            ]
    
            # The matched values (be it Price or Date) determined by the passed arg
            # column placed in a list
            matched_values = relevant_df[column].to_list()
    
            # If nothing was matched (this happens at ID boundaries where the price
            # went down so we are in this region, but we don't want to return an
            # empty list)
            if len(matched_values) == 0:
                return pd.NA
            # Return the values to put in the row
            return matched_values
    
    # Do this for the prices
    mod_df["Matched_Price"] = mod_df.apply(
        pull_values_w_price_less_than, args=("Price",), axis=1
    )
    # And the dates
    mod_df["Date_Values"] = mod_df.apply(
        pull_values_w_price_less_than, args=("Date",), axis=1
    )
    
    # Remove our helper columns
    output = mod_df.drop(columns=["shift", "diff"])
    
    # View the df (optional ofc)
    output
    

    I also put the values into a list rather than a string separated by commas, but you can easily convert those columns using this function:

    def list_to_comma_string(value_list) -> str:
        # If it was already null we cant change that
        if value_list is pd.NA:
            return pd.NA
        
        # list comprehension to change all the values to strings in the list
        joiner = [str(x) for x in value_list]
    
        # The most efficient way to combine strings in python
        return ','.join(joiner)
    
    # Convert both columns
    output['Matched_Price'] = output['Matched_Price'].apply(list_to_comma_string)
    output['Date_Values'] = output['Date_Values'].apply(list_to_comma_string)
    
    # Again view the df
    output
    

    Hope this helps!