Search code examples
pythonpython-polars

Avoiding double for loops with Polars?


I am trying to use Polars to determine revenue forecast for many products. I have these product names and prices and current revenues based on these prices. Some of these products' revenues are not direct multiplication of quantity and prices but involve a complicated function (distributor percentage etc and more) so I have created a separate function for that. I want to simulate 50 different scenarios of prices and apply these to the existing product portfolio to determine range of revenues etc. How can I do this using Polars without using for loops?

Specifically, I want to search for the exact product name in the column name of the price dataframe and then for each of the product names in the main dataframe create an updated price column in the main dataframe corresponding to the prices in price dataframe. This will be my first scenario. I will save this scenario as scenario1. This way I want to create as many scenarios as there are rows in the prices dataframe. How do I apply this using Polars without using for loops please?

Thanks in advance.

I am new to polars and haven't been successful in this without using for loops in pandas.


Update:

here is my main_df:

main_df = pl.from_repr("""
┌─────┬───────┐
│ xxx ┆ price │
│ --- ┆ ---   │
│ str ┆ str   │
╞═════╪═══════╡
│ A   ┆ 100   │
│ B   ┆ 150   │
│ C   ┆ 200   │
│ D   ┆ 250   │
│ A   ┆ 230   │
└─────┴───────┘
""")

here is my pixies_df:

pixies_df = pl.from_repr("""
┌─────┬─────┬─────┬─────┐
│ A   ┆ B   ┆ C   ┆ D   │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╡
│ 110 ┆ 160 ┆ 210 ┆ 260 │
│ 120 ┆ 170 ┆ 220 ┆ 270 │
│ 130 ┆ 180 ┆ 230 ┆ 280 │
└─────┴─────┴─────┴─────┘
""")

here is my expected output:

shape: (5, 5)
┌─────┬───────┬─────┬─────┬─────┐
│ xxx ┆ price ┆ 0   ┆ 1   ┆ 2   │
│ --- ┆ ----- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64   ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═══════╪═════╪═════╪═════╡
│ A   ┆ 100   ┆ 110 ┆ 120 ┆ 130 │
│ B   ┆ 150   ┆ 160 ┆ 170 ┆ 180 │
│ C   ┆ 200   ┆ 210 ┆ 220 ┆ 230 │
│ D   ┆ 250   ┆ 260 ┆ 270 ┆ 280 │
│ A   ┆ 230   ┆ 110 ┆ 120 ┆ 130 │
└─────┴───────┴─────┴─────┴─────┘
num_rows = pixies_df.height
def transform_column(col_name, col_values):
    return [f"{col_name}-{i}-{col_values[i]}" for i in range(num_rows)]

transformed_data = {col: transform_column(col, pixies_df[col].to_list()) for col in pixies_df.columns}

pixies_df_transformed = pl.DataFrame(transformed_data)

print(pixies_df_transformed)
shape: (3, 5)
┌───────────┬─────────┬─────────┬─────────┬─────────┐
│ index     ┆ A       ┆ B       ┆ C       ┆ D       │
│ ---       ┆ ---     ┆ ---     ┆ ---     ┆ ---     │
│ str       ┆ str     ┆ str     ┆ str     ┆ str     │
╞═══════════╪═════════╪═════════╪═════════╪═════════╡
│ index-0-0 ┆ A-0-110 ┆ B-0-160 ┆ C-0-210 ┆ D-0-260 │
│ index-1-1 ┆ A-1-120 ┆ B-1-170 ┆ C-1-220 ┆ D-1-270 │
│ index-2-2 ┆ A-2-130 ┆ B-2-180 ┆ C-2-230 ┆ D-2-280 │
└───────────┴─────────┴─────────┴─────────┴─────────┘

after this i want to extract the prices into main_df. Can you help please?

Also , this is a toy example. The retail HQ mimght contain more than 100000 SKUs (products) and price scenarios can exceed thousand of rows so a cartesian product using for loops is not efficient.

i am adding more details here in the main question: here i am joining the two dfs (it crashes the system for the actual rows so i took smaller size) and then i am using a for loop for applying the revenue function over each subset. I am sure there is a better polars way to do this? Thanks.

'''python

Get integer-like columns starting with 'column_'

integer_columns = [col for col in aa.columns if col.startswith('simprice_')]

# Define a function that processes each column
def process_column(col):
    subset = aa.select(non_integer_columns + [pl.col(col)])
    # Rename 'column_n' to 'underlying_price'
    #print(subset.columns)
return subset.rename({col: 'new_price'})

# Use a list comprehension with map to apply the function to     each column
subsets = list(map(process_column, integer_columns))
changes=[]
for subset in subsets:
# Calculate values and update the subset
updated_subset = revenue_function(subset)
changes.append(sum(updated_subset['revenues']))
print(changes)

def revenue_function(prices_df):
    # Extract prices
    prices = prices_df['price']
    price_adjusted = prices * np.random.uniform(0.9, 1.1,len(prices))  # Random adjustment between 90%-110%

    norm_dist_factor = norm.cdf(price_adjusted / np.mean(prices))  # Normalize prices and apply CDF
price_to_revenue_ratio = 1 + norm_dist_factor * np.random.uniform(0.8, 1.2)  # Further randomization


revenues = price_adjusted * price_to_revenue_ratio

noise = np.random.normal(0, 0.05, len(prices))  # Small random noise
revenues = revenues * (1 + noise)

revenues = np.round(revenues)
prices_df.with_columns(revenues = pl.Series(revenues))
return prices_df

'''


Solution

  • main_df.join(
        pixies_df
        .transpose(
            include_header=True,
            header_name="xxx",
            column_names = list(map(str, range(pixies_df.height)))
        ),
        on="xxx"
    )
    
    ┌─────┬───────┬─────┬─────┬─────┐
    │ xxx ┆ price ┆ 0   ┆ 1   ┆ 2   │
    │ --- ┆ ---   ┆ --- ┆ --- ┆ --- │
    │ str ┆ str   ┆ i64 ┆ i64 ┆ i64 │
    ╞═════╪═══════╪═════╪═════╪═════╡
    │ A   ┆ 100   ┆ 110 ┆ 120 ┆ 130 │
    │ B   ┆ 150   ┆ 160 ┆ 170 ┆ 180 │
    │ C   ┆ 200   ┆ 210 ┆ 220 ┆ 230 │
    │ D   ┆ 250   ┆ 260 ┆ 270 ┆ 280 │
    │ A   ┆ 230   ┆ 110 ┆ 120 ┆ 130 │
    └─────┴───────┴─────┴─────┴─────┘