Search code examples
performanceaggregate-functionspython-polars

How to use Polars Expressions to improve performance in Polars transformation?


I am not sure the best way to go about using Polars Expressions to speed-up my data analysis pipeline. Mainly, grouping/agg'ing data from one table and using those values in another computation--or reducing the number of columns created to use Expr to achieve the same logic. Here is an overview of what I am trying to do.

Overview:

I have one main function price_range(), which applies multiple Polars transformations. This function takes a parameter data and possibly a recent_prices_data (but this is not necessary for the question) and performs grouping and aggregation transformations on data.

  1. Calculate Standard Deviation: Depending on the _rv_adjustment flag, it calculates the standard deviation of detrended returns either for the same volatility bucket (a previously filtered DF) or the latest window_index for each symbol (which is .min() b/c of the indices are added backwards)
  2. Aggregate RS Data: Aggregates the RS data for each symbol, calculating the last, mean, min, and max of RS to calculate the RS statistics (commented in code)
  3. Join Data: Joins the RS data with the standard deviation of detrended returns and the most recent price data.
  4. Calculate Price Range: Calculates the price range using the specified _rs_method, the standard deviation of detrended returns, and the recent price.
  5. Adjust Prices: Calls _modify_mandelbrot_prices to adjust the top and bottom prices using the calculated price range and the latest cumulative sum values.

The internal function _modify_mandelbrot_prices is used to adjust the price_range_data calculated above.

  1. Aggregate Latest Cumulative Sum Values: For each symbol, it calculates the latest values of cum_sum_max and cum_sum_min columns.
  2. Join Data: It joins price_range_data with the calculated last_cum_sum_max and last_cum_sum_min columns, based on the symbol to bring in the latest cumulative sum values for further calculations.
  3. Calculate Modifiers: For each symbol, it calculates modifiers (top_modifier and bottom_modifier) to adjust the top and bottom prices. This is done to ensure division by zero is handled safely.
  4. Calculate Adjusted Prices: Using the calculated modifiers, it adjusts the top and bottom prices for the Mandelbrot channel.
  5. Select Relevant Columns: Finally, it selects and returns the relevant columns (symbol, bottom_price, recent_price, top_price).

Data:

There are 4 main data objects:

  • data - the main DF with columns needed for further calculations
  • recent_price_data - if None then uses data to pull last "close" column values per symbol
  • std_detrended_returns - an aggregated DF of std's per symbol (created in price_range())
  • price_range_data - an aggregated DF with std_detrended_returns, RS statistics, price_range and recent_price (created in price_range())

Functions:


RS_METHODS = ["RS", "RS_mean", "RS_max", "RS_min"]


def _modify_mandelbrot_prices(
    data: pl.DataFrame | pl.LazyFrame,
    _column_name_cum_sum_max: str,
    _column_name_cum_sum_min: str,
) -> pl.DataFrame | pl.LazyFrame:
    out = (
        data.lazy()
        # Calculate modifiers with safe division to avoid ZeroDivisionError
        .with_columns(
            [
                pl.when(
                    pl.col("last_cum_sum_max") - pl.col("last_cum_sum_min") != 0
                )
                .then(
                    pl.col("last_cum_sum_max")
                    / (pl.col("last_cum_sum_max") - pl.col("last_cum_sum_min"))
                )
                .otherwise(1)
                .alias("top_modifier"),
                pl.when(
                    pl.col("last_cum_sum_max") - pl.col("last_cum_sum_min") != 0
                )
                .then(
                    pl.col("last_cum_sum_min")
                    / (pl.col("last_cum_sum_max") - pl.col("last_cum_sum_min"))
                )
                .otherwise(1)
                .alias("bottom_modifier"),
            ]
        )
        # Calculate top and bottom prices
        .with_columns(
            [
                (
                    pl.col("recent_price")
                    + pl.col("price_range") * pl.col("top_modifier")
                )
                .round(4)
                .alias("top_price"),
                (
                    pl.col("recent_price")
                    + pl.col("price_range") * pl.col("bottom_modifier")
                )
                .round(4)
                .alias("bottom_price"),
            ]
        )
        # Select relevant columns
        .select(
            [
                "symbol",
                "bottom_price",
                "recent_price",
                "top_price",
            ]
        )
    )

    return out


def price_range2(
    data: pl.LazyFrame | pl.DataFrame,
    recent_price_data: pl.DataFrame | pl.LazyFrame | None,
    _rs_method: str = "RS",
    _detrended_returns: str = "detrended_log_returns",  # Parameterized detrended_returns column
    _column_name_cum_sum_max: str = "cum_sum_max",
    _column_name_cum_sum_min: str = "cum_sum_min",
    *,
    _rv_adjustment: bool = False,
    _sort: bool = True,
    **kwargs,
) -> pl.DataFrame | pl.LazyFrame:
    # Check if RS_method is one of the allowed values
    if _rs_method not in RS_METHODS:
        msg = "RS_method must be one of 'RS', 'RS_mean', 'RS_max', 'RS_min'"
        raise HumblDataError(msg)

    sort_cols = ["symbol","date"]
    if _sort:
        data.sort(sort_cols)

    # Define a conditional expression for std_detrended_returns based on _rv_adjustment
    last_cum_sum_max = (
            pl.col(_column_name_cum_sum_max).last().alias("last_cum_sum_max")
        )
    last_cum_sum_min = (
        pl.col(_column_name_cum_sum_min).last().alias("last_cum_sum_min")
    )

    std_detrended_returns_expr = (
        pl.col(_detrended_returns).std().alias(f"std_{_detrended_returns}")
        if _rv_adjustment
        else pl.col(_detrended_returns)
        .filter(pl.col("window_index") == pl.col("window_index").min())
        .std()
        .alias(f"std_{_detrended_returns}")
    )
    if recent_price_data is None:
        # if no recent_prices_data is passed, then pull the most recent prices from the data
        recent_price_expr = pl.col("close").last().alias("recent_price")
        # Perform a single group_by operation to calculate both STD of detrended returns and RS statistics
        price_range_data = (
            data.lazy()
            .group_by("symbol")
            .agg(
                [
                    # Conditional STD calculation based on _rv_adjustment
                    std_detrended_returns_expr,
                    # Recent Price Data
                    recent_price_expr,
                    # cum_sum_max/min last
                    last_cum_sum_max,
                    last_cum_sum_min,
                    # RS statistics
                    pl.col("RS").last().alias("RS"),
                    pl.col("RS").mean().alias("RS_mean"),
                    pl.col("RS").min().alias("RS_min"),
                    pl.col("RS").max().alias("RS_max"),
                ]
            )
            # Join with recent_price_data on symbol
            .with_columns(
                (
                    pl.col(_rs_method)
                    * pl.col("std_detrended_log_returns")
                    * pl.col("recent_price")
                ).alias("price_range")
            )
            .sort("symbol")
        )
    else:
        price_range_data = (
            data.lazy()
            .group_by("symbol")
            .agg(
                [
                    # Conditional STD calculation based on _rv_adjustment
                    std_detrended_returns_expr,
                    # cum_sum_max/min last
                    last_cum_sum_max,
                    last_cum_sum_min,
                    # RS statistics
                    pl.col("RS").last().alias("RS"),
                    pl.col("RS").mean().alias("RS_mean"),
                    pl.col("RS").min().alias("RS_min"),
                    pl.col("RS").max().alias("RS_max"),
                ]
            )
            # Join with recent_price_data on symbol
            .join(recent_price_data.lazy(), on="symbol")
            .with_columns(
                (
                    pl.col(_rs_method)
                    * pl.col("std_detrended_log_returns")
                    * pl.col("recent_price")
                ).alias("price_range")
            )
            .sort("symbol")
        )
    # Relative Position Modifier
    out = _modify_mandelbrot_prices(
        price_range_data,
        "cum_sum_max",
        "cum_sum_min",
    )

    return out


Data:

data = pl.read_csv(
b"""
date,open,high,low,close,volume,symbol,log_returns,window_index,window_mean,detrended_log_returns,cum_sum,cum_sum_min,cum_sum_max,cum_sum_range,cum_sum_std,RS
2024-02-09T00:00:00.000000000,188.65,189.99,188.0,188.85,45155200,AAPL,0.004085644926651533,0,-0.00483920214743363,0.008924847074085163,0.008924847074085163,-0.019153243456081548,0.02866526974133947,0.04781851319742102,0.015173711361409261,3.1514052204153606
2024-02-12T00:00:00.000000000,188.42,188.67,186.79,187.15,41781900,AAPL,-0.009042614808027594,0,-0.00483920214743363,-0.004203412660593963,0.0047214344134912,-0.019153243456081548,0.02866526974133947,0.04781851319742102,0.015173711361409261,3.1514052204153606
2024-02-13T00:00:00.000000000,185.77,186.21,183.51,185.04,56529500,AAPL,-0.01133841642725475,0,-0.00483920214743363,-0.006499214279821119,-0.0017777798663299194,-0.019153243456081548,0.02866526974133947,0.04781851319742102,0.015173711361409261,3.1514052204153606
2024-02-14T00:00:00.000000000,185.32,185.53,182.44,184.15,54630500,AAPL,-0.00482137503210911,0,-0.00483920214743363,0.00001782711532452054,-0.0017599527510053989,-0.019153243456081548,0.02866526974133947,0.04781851319742102,0.015173711361409261,3.1514052204153606
2024-02-15T00:00:00.000000000,183.55,184.49,181.35,183.86,65434500,AAPL,-0.001576044455465997,0,-0.00483920214743363,0.0032631576919676334,0.0015032049409622345,-0.019153243456081548,0.02866526974133947,0.04781851319742102,0.015173711361409261,3.1514052204153606
2024-02-16T00:00:00.000000000,183.42,184.85,181.67,182.31,49701400,AAPL,-0.00846606362023472,0,-0.00483920214743363,-0.0036268614728010903,-0.0021236565318388557,-0.019153243456081548,0.02866526974133947,0.04781851319742102,0.015173711361409261,3.1514052204153606
2024-02-20T00:00:00.000000000,181.79,182.43,180.0,181.56,53665600,AAPL,-0.004122357227108964,0,-0.00483920214743363,0.0007168449203246666,-0.0014068116115141891,-0.019153243456081548,0.02866526974133947,0.04781851319742102,0.015173711361409261,3.1514052204153606
2024-02-21T00:00:00.000000000,181.94,182.89,180.66,182.32,41529700,AAPL,0.004177207349163936,0,-0.00483920214743363,0.009016409496597566,0.007609597885083377,-0.019153243456081548,0.02866526974133947,0.04781851319742102,0.015173711361409261,3.1514052204153606
2024-02-22T00:00:00.000000000,183.48,184.96,182.46,184.37,52292200,AAPL,0.011181223144870245,0,-0.00483920214743363,0.016020425292303873,0.02363002317738725,-0.019153243456081548,0.02866526974133947,0.04781851319742102,0.015173711361409261,3.1514052204153606
2024-02-23T00:00:00.000000000,185.01,185.04,182.23,182.52,45119700,AAPL,-0.01008485202309739,0,-0.00483920214743363,-0.00524564987566376,0.01838437330172349,-0.019153243456081548,0.02866526974133947,0.04781851319742102,0.015173711361409261,3.1514052204153606
2024-02-26T00:00:00.000000000,182.24,182.76,180.65,181.16,40867400,AAPL,-0.007479137371188216,0,-0.00483920214743363,-0.0026399352237545855,0.0157444380779689,-0.019153243456081548,0.02866526974133947,0.04781851319742102,0.015173711361409261,3.1514052204153606
2024-02-27T00:00:00.000000000,181.1,183.92,179.56,182.63,54318900,AAPL,0.008081629515936939,0,-0.00483920214743363,0.01292083166337057,0.02866526974133947,-0.019153243456081548,0.02866526974133947,0.04781851319742102,0.015173711361409261,3.1514052204153606
2024-02-28T00:00:00.000000000,182.51,183.12,180.13,181.42,48953900,AAPL,-0.006647463017152688,0,-0.00483920214743363,-0.001808260869719058,0.02685700887162041,-0.019153243456081548,0.02866526974133947,0.04781851319742102,0.015173711361409261,3.1514052204153606
2024-02-29T00:00:00.000000000,181.27,182.57,179.53,180.75,136682600,AAPL,-0.003699924147923639,0,-0.00483920214743363,0.0011392779995099914,0.0279962868711304,-0.019153243456081548,0.02866526974133947,0.04781851319742102,0.015173711361409261,3.1514052204153606
2024-03-01T00:00:00.000000000,179.55,180.53,177.38,179.66,73488000,AAPL,-0.006048685237813878,0,-0.00483920214743363,-0.0012094830903802475,0.02678680378075015,-0.019153243456081548,0.02866526974133947,0.04781851319742102,0.015173711361409261,3.1514052204153606
2024-03-04T00:00:00.000000000,176.15,176.9,173.79,175.1,81510100,AAPL,-0.025708936509253988,0,-0.00483920214743363,-0.02086973436182036,0.005917069418929791,-0.019153243456081548,0.02866526974133947,0.04781851319742102,0.015173711361409261,3.1514052204153606
2024-03-05T00:00:00.000000000,170.76,172.04,169.62,170.12,95132400,AAPL,-0.028853168906373128,0,-0.00483920214743363,-0.0240139667589395,-0.01809689734000971,-0.019153243456081548,0.02866526974133947,0.04781851319742102,0.015173711361409261,3.1514052204153606
2024-03-06T00:00:00.000000000,171.06,171.24,168.68,169.12,68587700,AAPL,-0.005895548263505468,0,-0.00483920214743363,-0.0010563461160718376,-0.019153243456081548,-0.019153243456081548,0.02866526974133947,0.04781851319742102,0.015173711361409261,3.1514052204153606
2024-03-07T00:00:00.000000000,169.15,170.73,168.49,169.0,71765100,AAPL,-0.0007098071988540511,0,-0.00483920214743363,0.004129394948579579,-0.015023848507501969,-0.019153243456081548,0.02866526974133947,0.04781851319742102,0.015173711361409261,3.1514052204153606
2024-03-08T00:00:00.000000000,169.0,173.7,168.94,170.73,76114600,AAPL,0.010184646360068328,0,-0.00483920214743363,0.015023848507501958,-1.0408340855860843e-17,-0.019153243456081548,0.02866526974133947,0.04781851319742102,0.015173711361409261,3.1514052204153606
2024-02-08T00:00:00.000000000,145.83,146.33,145.1,145.91,22563800,GOOGL,-0.15709336800721108,0,-0.011036959442305195,-0.14605640856490587,-0.14605640856490587,-0.14605640856490587,-5.204170427930421e-17,0.14605640856490582,0.03725400168965407,3.920556233975466
2024-02-09T00:00:00.000000000,146.68,149.44,146.18,149.0,26829500,GOOGL,0.020956312669528465,0,-0.011036959442305195,0.03199327211183366,-0.11406313645307221,-0.14605640856490587,-5.204170427930421e-17,0.14605640856490582,0.03725400168965407,3.920556233975466
2024-02-12T00:00:00.000000000,148.42,149.34,147.37,147.53,21564100,GOOGL,-0.009914761015983764,0,-0.011036959442305195,0.0011221984263214309,-0.11294093802675079,-0.14605640856490587,-5.204170427930421e-17,0.14605640856490582,0.03725400168965407,3.920556233975466
2024-02-13T00:00:00.000000000,144.92,146.67,143.69,145.14,27837700,GOOGL,-0.016332751079485064,0,-0.011036959442305195,-0.005295791637179869,-0.11823672966393066,-0.14605640856490587,-5.204170427930421e-17,0.14605640856490582,0.03725400168965407,3.920556233975466
2024-02-14T00:00:00.000000000,146.08,146.52,144.09,145.94,22704200,GOOGL,0.005496784487483808,0,-0.011036959442305195,0.016533743929789003,-0.10170298573414166,-0.14605640856490587,-5.204170427930421e-17,0.14605640856490582,0.03725400168965407,3.920556233975466
2024-02-15T00:00:00.000000000,143.14,143.52,140.46,142.77,37590700,GOOGL,-0.021960634536345225,0,-0.011036959442305195,-0.01092367509404003,-0.1126266608281817,-0.14605640856490587,-5.204170427930421e-17,0.14605640856490582,0.03725400168965407,3.920556233975466
2024-02-16T00:00:00.000000000,142.99,143.19,140.14,140.52,31451100,GOOGL,-0.015885116403502764,0,-0.011036959442305195,-0.004848156961197569,-0.11747481778937927,-0.14605640856490587,-5.204170427930421e-17,0.14605640856490582,0.03725400168965407,3.920556233975466
2024-02-20T00:00:00.000000000,139.66,142.08,139.49,141.12,25144700,GOOGL,0.004260764860855026,0,-0.011036959442305195,0.015297724303160221,-0.10217709348621905,-0.14605640856490587,-5.204170427930421e-17,0.14605640856490582,0.03725400168965407,3.920556233975466
2024-02-21T00:00:00.000000000,141.45,142.69,140.68,142.55,23315700,GOOGL,0.010082223100200238,0,-0.011036959442305195,0.021119182542505433,-0.08105791094371362,-0.14605640856490587,-5.204170427930421e-17,0.14605640856490582,0.03725400168965407,3.920556233975466
2024-02-22T00:00:00.000000000,144.93,145.0,142.8,144.09,27191900,GOOGL,0.010745288986161228,0,-0.011036959442305195,0.021782248428466423,-0.0592756625152472,-0.14605640856490587,-5.204170427930421e-17,0.14605640856490582,0.03725400168965407,3.920556233975466
2024-02-23T00:00:00.000000000,143.67,144.68,143.43,143.96,19493800,GOOGL,-0.0009026211340126267,0,-0.011036959442305195,0.010134338308292568,-0.049141324206954634,-0.14605640856490587,-5.204170427930421e-17,0.14605640856490582,0.03725400168965407,3.920556233975466
2024-02-26T00:00:00.000000000,142.14,142.44,137.39,137.57,53641800,GOOGL,-0.04540260473792568,0,-0.011036959442305195,-0.03436564529562049,-0.08350696950257512,-0.14605640856490587,-5.204170427930421e-17,0.14605640856490582,0.03725400168965407,3.920556233975466
2024-02-27T00:00:00.000000000,138.02,139.25,137.09,138.88,33099200,GOOGL,0.009477372439135934,0,-0.011036959442305195,0.02051433188144113,-0.062992637621134,-0.14605640856490587,-5.204170427930421e-17,0.14605640856490582,0.03725400168965407,3.920556233975466
2024-02-28T00:00:00.000000000,137.9,138.01,135.41,136.38,37328600,GOOGL,-0.018165143819533824,0,-0.011036959442305195,-0.0071281843772286295,-0.07012082199836263,-0.14605640856490587,-5.204170427930421e-17,0.14605640856490582,0.03725400168965407,3.920556233975466
2024-02-29T00:00:00.000000000,137.28,138.86,136.4,138.46,42133000,GOOGL,0.01513636815737307,0,-0.011036959442305195,0.026173327599678265,-0.04394749439868437,-0.14605640856490587,-5.204170427930421e-17,0.14605640856490582,0.03725400168965407,3.920556233975466
2024-03-01T00:00:00.000000000,138.43,138.87,136.92,137.14,31119500,GOOGL,-0.009579173393661655,0,-0.011036959442305195,0.0014577860486435397,-0.04248970835004083,-0.14605640856490587,-5.204170427930421e-17,0.14605640856490582,0.03725400168965407,3.920556233975466
2024-03-04T00:00:00.000000000,135.66,135.66,131.91,133.35,55999400,GOOGL,-0.028025051228194542,0,-0.011036959442305195,-0.016988091785889347,-0.05947780013593018,-0.14605640856490587,-5.204170427930421e-17,0.14605640856490582,0.03725400168965407,3.920556233975466
2024-03-05T00:00:00.000000000,131.88,133.24,130.67,132.67,40194800,GOOGL,-0.0051124086991976014,0,-0.011036959442305195,0.0059245507431075935,-0.05355324939282259,-0.14605640856490587,-5.204170427930421e-17,0.14605640856490582,0.03725400168965407,3.920556233975466
2024-03-06T00:00:00.000000000,133.12,133.58,130.85,131.4,35318600,GOOGL,-0.009618735878315299,0,-0.011036959442305195,0.0014182235639898962,-0.0521350258288327,-0.14605640856490587,-5.204170427930421e-17,0.14605640856490582,0.03725400168965407,3.920556233975466
2024-03-07T00:00:00.000000000,132.79,134.94,131.61,134.38,37738200,GOOGL,0.0224255014414938,0,-0.011036959442305195,0.03346246088379899,-0.018672564945033707,-0.14605640856490587,-5.204170427930421e-17,0.14605640856490582,0.03725400168965407,3.920556233975466
2024-03-08T00:00:00.000000000,134.21,138.09,134.0,135.41,39343100,GOOGL,0.00763560550272846,0,-0.011036959442305195,0.018672564945033655,-5.204170427930421e-17,-0.14605640856490587,-5.204170427930421e-17,0.14605640856490582,0.03725400168965407,3.920556233975466

"""
)

Improvements:

I have been fiddling with using pl.Expr to speed this up, referring to the last_cum_sum_min column creation/recent_price_data usage, but met with some errors that I am trying to work through.

I think using Expressions to reduce the amount of start-stop logic may help.

Perhaps instead of creating multiple data objects, do all the logic in one execution?

Desired Result:


| symbol | bottom_price | recent_price | top_price |
|--------|--------------|--------------|-----------|
| str    | f64          | f64          | f64       |
| "AAPL" | 168.5382     | 170.73       | 174.0104  |
| "GOOGL"| 115.478      | 135.41       | 135.41    |

Solution

  • What errors are you running into specifically?

    The following seems to work (I've reformatted your code to reduce redundancy – including removing extraneous pl.DataFrame.lazy() calls):

    import polars as pl
    
    
    def _modify_mandelbrot_prices(data: pl.DataFrame) -> pl.DataFrame:
        out = (
            data
            # Calculate modifiers with safe division to avoid ZeroDivisionError
            .with_columns(
                [
                    pl.when(
                        pl.col("last_cum_sum_max") - pl.col("last_cum_sum_min")
                        != 0
                    )
                    .then(
                        pl.col("last_cum_sum_max")
                        / (
                            pl.col("last_cum_sum_max")
                            - pl.col("last_cum_sum_min")
                        )
                    )
                    .otherwise(1)
                    .alias("top_modifier"),
                    pl.when(
                        pl.col("last_cum_sum_max") - pl.col("last_cum_sum_min")
                        != 0
                    )
                    .then(
                        pl.col("last_cum_sum_min")
                        / (
                            pl.col("last_cum_sum_max")
                            - pl.col("last_cum_sum_min")
                        )
                    )
                    .otherwise(1)
                    .alias("bottom_modifier"),
                ]
            )
            # Calculate top and bottom prices
            .with_columns(
                [
                    (
                        pl.col("recent_price")
                        + pl.col("price_range") * pl.col("top_modifier")
                    )
                    .round(4)
                    .alias("top_price"),
                    (
                        pl.col("recent_price")
                        + pl.col("price_range") * pl.col("bottom_modifier")
                    )
                    .round(4)
                    .alias("bottom_price"),
                ]
            )
            .select(["symbol", "bottom_price", "recent_price", "top_price",])
        ).drop_nulls()
    
        return out
    
    
    def price_range(
        data: pl.DataFrame,
        recent_price_data: pl.DataFrame = None,
        _rs_method: str = "RS_mean",
        _detrended_returns: str = "detrended_log_returns",
        _column_name_cum_sum_max: str = "cum_sum_max",
        _column_name_cum_sum_min: str = "cum_sum_min",
        _rv_adjustment: bool = False,
        _sort: bool = True,
    ) -> pl.DataFrame:
    
        if _sort:
            data = data.sort(["symbol", "date"])
    
        std_detrended_returns_expr = (
            pl.col(_detrended_returns).std()
            if not _rv_adjustment
            else pl.col(_detrended_returns)
            .filter(pl.col("window_index") == pl.col("window_index").min())
            .std()
        ).alias(f"std_{_detrended_returns}")
    
        last_cum_sum_max = (
            pl.col(_column_name_cum_sum_max).last().alias("last_cum_sum_max")
        )
        last_cum_sum_min = (
            pl.col(_column_name_cum_sum_min).last().alias("last_cum_sum_min")
        )
    
        rs_expr = pl.col("RS")
        rs_mean_expr = pl.col("RS").mean().alias("RS_mean")
        rs_min_expr = pl.col("RS").min().alias("RS_min")
        rs_max_expr = pl.col("RS").max().alias("RS_max")
    
        if recent_price_data is None:
            recent_price_expr = pl.col("close").last().alias("recent_price")
        else:
            recent_price_expr = pl.col("recent_price")
    
        price_range_expr = (
            pl.col(_rs_method)
            * pl.col(f"std_{_detrended_returns}")
            * pl.col("recent_price")
        ).alias("price_range")
    
        price_range_data = (
            data.lazy()
            .group_by("symbol")
            .agg(
                [
                    std_detrended_returns_expr,
                    last_cum_sum_max,
                    last_cum_sum_min,
                    rs_expr,
                    rs_mean_expr,
                    rs_min_expr,
                    rs_max_expr,
                    recent_price_expr,
                ]
            )
            .with_columns(price_range_expr)
            .sort("symbol")
        )
    
        modified_prices = _modify_mandelbrot_prices(price_range_data)
    
        return modified_prices
    
    
    result = price_range(data).collect()
    print(result)
    
    
    shape: (2, 4)
    ┌────────┬──────────────┬──────────────┬───────────┐
    │ symbol ┆ bottom_price ┆ recent_price ┆ top_price │
    │ ---    ┆ ---          ┆ ---          ┆ ---       │
    │ str    ┆ f64          ┆ f64          ┆ f64       │
    ╞════════╪══════════════╪══════════════╪═══════════╡
    │ AAPL   ┆ 168.5382     ┆ 170.73       ┆ 174.0104  │
    │ GOOGL  ┆ 115.478      ┆ 135.41       ┆ 135.41    │
    └────────┴──────────────┴──────────────┴───────────┘
    

    Scaling given dataset to ~0.5 million rows

    I used the following code to extend the example data you provide by four orders of magnitude (so, from 42 rows to 420,000) and ran the exact same code given above — the polars expressions still completes in a fraction of a second (less than 200ms on my desktop machine):

    import random
    import numpy as np
    
    given_data = data
    
    # Number of times to extend the data
    extension_factor = 10000
    
    # Extend the data by repeating it
    extended_data = given_data
    for _ in range(extension_factor - 1):
        extended_data = extended_data.vstack(given_data)
    
    
    def add_noise(values):
        """Adding noise (up to 10%) to the extended data."""
        noise = np.random.uniform(-0.1, 0.1, len(values)) * values
        return values + noise
    
    
    # Apply noise to numeric columns
    for col in extended_data.columns:
        if (
            extended_data[col].dtype == pl.Int64
            or extended_data[col].dtype == pl.Float64
        ):
            extended_data = extended_data.with_columns(
                add_noise(extended_data[col]).alias(col)
            )
    
    print(extended_data.glimpse(max_items_per_column=1))
    
    Rows: 420000
    Columns: 17
    $ date                  <str> '2024-02-09T00:00:00.000000000'
    $ open                  <f64> 187.2438180278254
    $ high                  <f64> 192.82330116345645
    $ low                   <f64> 192.61383272866942
    $ close                 <f64> 202.8229322724187
    $ volume                <f64> 45188012.34240934
    $ symbol                <str> 'AAPL'
    $ log_returns           <f64> 0.00406418924911009
    $ window_index          <f64> 0.0
    $ window_mean           <f64> -0.005162584160187632
    $ detrended_log_returns <f64> 0.008096107491203208
    $ cum_sum               <f64> 0.009022348695473496
    $ cum_sum_min           <f64> -0.019367091379226153
    $ cum_sum_max           <f64> 0.028760912862026075
    $ cum_sum_range         <f64> 0.044017430587399306
    $ cum_sum_std           <f64> 0.013911469762489888
    $ RS                    <f64> 3.4041538919963745