Search code examples
pythonoptimizationpython-polars

Optimizing Variable Combinations to Maximize a Classification


I am working with a dataset where users interact via an app or a website, and I need to determine the optimal combination of variables (x1, x2, ... xn) that will maximize the number of users classified as "APP Lovers." According to the business rule, a user is considered an "APP Lover" if they use the app more than 66% of the time.

Here’s a simplified example of the data structure:

import polars as pl
df = pl.DataFrame({
    "ID": [1, 2, 3, 1, 2, 3, 1, 2, 3],
    "variable": ["x1", "x1", "x1", "x2", "x2", "x2", "x3", "x3", "x3"],
    "Favourite": ["APP", "APP", "WEB", "APP", "WEB", "APP", "APP", "APP", "WEB"]
})

In this dataset, each ID represents a user, and variable refers to the function (e.g., x1, x2, x3), with Favourite indicating whether the function was executed via the app or the website.

I pivot the data to count how many actions were performed via APP or WEB:

(
    df
    .pivot(
        index=["ID"],
        on="Favourite",
        values=["variable"],
        aggregate_function=pl.col("Favourite").len()
    ).fill_null(0)
)

Output:

shape: (3, 3)
┌─────┬─────┬─────┐
│ ID  ┆ APP ┆ WEB │
│ --- ┆ --- ┆ --- │
│ i64 ┆ u32 ┆ u32 │
╞═════╪═════╪═════╡
│ 1   ┆ 3   ┆ 0   │
│ 2   ┆ 2   ┆ 1   │
│ 3   ┆ 1   ┆ 2   │
└─────┴─────┴─────┘

Next, I calculate the proportion of app usage for each user and classify them:

(
    df2
    .with_columns(
        Total = pl.col("APP") + pl.col("WEB")
    )
    .with_columns(
        Proportion = pl.col("APP") / pl.col("Total")
    )
    .with_columns(
        pl
        .when(pl.col("Proportion") >= 0.6).then(pl.lit("APP Lover"))
        .when(pl.col("Proportion") > 0.1).then(pl.lit("BOTH"))
        .otherwise(pl.lit("Inactive"))
    )
)

shape: (3, 6)
┌─────┬─────┬─────┬───────┬────────────┬───────────┐
│ ID  ┆ APP ┆ WEB ┆ Total ┆ Proportion ┆ literal   │
│ --- ┆ --- ┆ --- ┆ ---   ┆ ---        ┆ ---       │
│ i64 ┆ u32 ┆ u32 ┆ u32   ┆ f64        ┆ str       │
╞═════╪═════╪═════╪═══════╪════════════╪═══════════╡
│ 1   ┆ 3   ┆ 0   ┆ 3     ┆ 1.0        ┆ APP Lover │
│ 2   ┆ 2   ┆ 1   ┆ 3     ┆ 0.666667   ┆ APP Lover │
│ 3   ┆ 1   ┆ 2   ┆ 3     ┆ 0.333333   ┆ BOTH      │
└─────┴─────┴─────┴───────┴────────────┴───────────┘

The challenge: In my real dataset, I have at least 19 different x variables. Yesterday asked, I tried iterating over all possible combinations of these variables to filter out the ones that result in the highest number of "APP Lovers," but the number of combinations (2^19) is too large to compute efficiently.

Question: How can I efficiently determine the best combination of xn variables that maximizes the number of "APP Lovers"? I'm looking for guidance on how to approach this in terms of algorithmic optimization or more efficient iterations.


Solution

  • Here's my suggestion, take the data:

    df = pl.DataFrame({
        "id": [1, 2, 3, 1, 2, 3, 1, 2, 3],
        "variable": ["x1", "x1", "x1", "x2", "x2", "x2", "x3", "x3", "x3"],
        "favorite": ["APP", "APP", "WEB", "APP", "WEB", "APP", "APP", "APP", "WEB"]
    })
    

    and pivot it such that column xi is true if user id uses that action primarily through the app:

    action_through_app = (
        df
        .with_columns(pl.col.favorite == "APP")
        .pivot(index="id", on="variable", values="favorite")
    )
    

    For example:

    shape: (3, 4)
    ┌─────┬───────┬───────┬───────┐
    │ id  ┆ x1    ┆ x2    ┆ x3    │
    │ --- ┆ ---   ┆ ---   ┆ ---   │
    │ i64 ┆ bool  ┆ bool  ┆ bool  │
    ╞═════╪═══════╪═══════╪═══════╡
    │ 1   ┆ true  ┆ true  ┆ true  │
    │ 2   ┆ true  ┆ false ┆ true  │
    │ 3   ┆ false ┆ true  ┆ false │
    └─────┴───────┴───────┴───────┘
    

    Now we can efficiently query if given some combination variables how many users would be app lovers by summing the relevant columns and checking if their sums are >= 0.6 * the number of columns.

    def num_app_lovers(combination):
        return (pl.sum_horizontal(combination) >= 0.6*len(combination)).sum()
    
    action_through_app.select(
        num_app_lovers([pl.col.x1]).alias("x1"),
        num_app_lovers([pl.col.x2]).alias("x2"),
        num_app_lovers([pl.col.x3]).alias("x3"),
        num_app_lovers([pl.col.x1, pl.col.x2]).alias("x12"),
        num_app_lovers([pl.col.x2, pl.col.x3]).alias("x23"),
        num_app_lovers([pl.col.x1, pl.col.x3]).alias("x13"),
        num_app_lovers([pl.col.x1, pl.col.x2, pl.col.x3]).alias("x123"),
    )
    
    shape: (1, 7)
    ┌─────┬─────┬─────┬─────┬─────┬─────┬──────┐
    │ x1  ┆ x2  ┆ x3  ┆ x12 ┆ x23 ┆ x13 ┆ x123 │
    │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ ---  │
    │ u32 ┆ u32 ┆ u32 ┆ u32 ┆ u32 ┆ u32 ┆ u32  │
    ╞═════╪═════╪═════╪═════╪═════╪═════╪══════╡
    │ 2   ┆ 2   ┆ 2   ┆ 1   ┆ 1   ┆ 2   ┆ 2    │
    └─────┴─────┴─────┴─────┴─────┴─────┴──────┘
    

    Now this lets you query combinations in bulk, but this still doesn't scale well to 2^19 possible combinations. For that problem I'd suggest using evolutionary programming.

    Initialize a pool of possible combinations with x1, x2, x3, ... xn. Then, randomly add or remove a column (if > 1 column) to each combination in your pool, and test them with the above query. Keep the top, say, 100 combinations. Repeat this process for a bunch of iterations until the result no longer improves, and return that.