Search code examples
pythonpandasdataframeoptimizationsampling

Efficient random subsample of a Pandas DataFrame based on multiple columns target requirements


I am currently working on a project where I have a large DataFrame (500'000 rows) containing polygons as rows, with each polygon representing a geographical area. The columns of the DataFrame represent different landcover classes (34 classes), and the values in the cells represent the area covered by each landcover class in square kilometers.

My objective is to subsample this DataFrame based on target requirements for landcover classes. Specifically, I want to select a subset of polygons that collectively meet certain target coverage requirements for each landcover class. The target requirements are specified as the desired area coverage for each landcover class.

Some collegue hinted that this could be interpreted as an optimisation problem with an objective function. However, I have not found a solution to it yet and tried a different, slow, iterative approach (see below).

To give you a better understanding, here is a minimum reproducible example of my DataFrame structure with only 4 polygons and 3 classes:

import pandas as pd

# Create a sample DataFrame
data = {
    'Polygon': ['Polygon A', 'Polygon B', 'Polygon C', 'Polygon D'],
    'Landcover 1': [10, 5, 7, 3],
    'Landcover 2': [15, 8, 4, 6],
    'Landcover 3': [20, 12, 9, 14]
}

df = pd.DataFrame(data)

For instance, let's say I have the following target requirements for landcover classes:

target_requirements = {
    'Landcover 1': 15,
    'Landcover 2': 20,
    'Landcover 3': 25
}

Based on these target requirements, I would like to subsample the DataFrame by selecting a subset of polygons that collectively meet or closely approximate the target area coverage for each landcover class. In this example, the polygons A and C are good subsamples as their landcover coverages summed together comes close to the requirements I set.

My [extended] code so far

Here is what I coded so far. You will see some extra steps which are implemented here:

  1. Weights: to guide the selection of polygons using deficits and surplus
  2. Random sampling of top 0.5%: based on weights, I select the top 0.5% polygons and randomly pick 1 from this selection.
  3. Tolerance: I set a tolerance for discrepancies between cumulated areas found with the current subsample and the requirements needed.
  4. Progress bar: aesthetic.
import numpy as np
import pandas as pd
from tqdm import tqdm

def select_polygons(row, cumulative_coverages, landcover_columns, target_coverages):
    selected_polygon = row[landcover_columns]

    # Add the selected polygon to the subsample
    subsample = selected_polygon.to_frame().T
    cumulative_coverages += selected_polygon.values

    return cumulative_coverages, subsample

df_data = # Your DataFrame with polygons and landcover classes
landcover_columns = # List of landcover columns in the DataFrame
target_coverages = # Dictionary of target coverages for each landcover class

total_coverages = df_data[landcover_columns].sum()
target_coverages = pd.Series(target_coverages, landcover_columns)

df_data = df_data.sample(frac=1).dropna().reset_index(drop=True)

# Set parameters for convergence
max_iterations = 30000
convergence_threshold = 0.1
top_percentage = 0.005

# Initialize variables
subsample = pd.DataFrame(columns=landcover_columns)
cumulative_coverages = pd.Series(0, index=landcover_columns)

# Initialize tqdm progress bar
progress_bar = tqdm(total=max_iterations)

# Iterate until the cumulative coverage matches or is close to the target coverage
for iteration in range(max_iterations):
    remaining_diff = target_coverages - cumulative_coverages
    deficit = remaining_diff.clip(lower=0)
    surplus = remaining_diff.clip(upper=0) * 0.1
    deficit_sum = deficit.sum()

    normalized_weights = deficit / deficit_sum

    # Calculate the combined weights for deficit and surplus for the entire dataset
    weights = df_data[landcover_columns].mul(normalized_weights) + surplus

    # Calculate the weight sum for each polygon
    weight_sum = weights.sum(axis=1)

    # Select the top 1% polygons based on weight sum
    top_percentile = int(len(df_data) * top_percentage)
    top_indices = weight_sum.nlargest(top_percentile).index
    selected_polygon_index = np.random.choice(top_indices)

    selected_polygon = df_data.loc[selected_polygon_index]

    cumulative_coverages, subsample_iteration = select_polygons(
        selected_polygon, cumulative_coverages, landcover_columns, target_coverages
    )

    # Add the selected polygon to the subsample
    subsample = subsample.append(subsample_iteration)

    df_data = df_data.drop(selected_polygon_index)

    # Check if all polygons have been selected or the cumulative coverage matches or is close to the target coverage
    if df_data.empty or np.allclose(cumulative_coverages, target_coverages, rtol=convergence_threshold):
        break

    # Calculate the percentage of coverage achieved
    coverage_percentage = (cumulative_coverages.sum() / target_coverages.sum()) * 100

    # Update tqdm progress bar
    progress_bar.set_description(f"Iteration {iteration+1}: Coverage Percentage: {coverage_percentage:.2f}%")
    progress_bar.update(1)

progress_bar.close()

subsample.reset_index(drop=True, inplace=True)

The problem

Code is slow (10 iterations/s) and doesn't manage well tolerance, i.e I can get cumulative_coverages way above 100% while tolerance is not met yet ( my "guidance for selection" is not good enough). Plus, there must be a much better OPTIMISATION to get what I want.

Any help/idea would be appreciated.


Solution

  • The problem description lends itself to solving with a Mixed-Integer-Linear Program (MIP).

    A convenient library for solving mixed integer problems is PuLP that ships with the built-in Coin-OR suite and in particular the integer solver CBC.

    Note that I changed the data structure from the DataFrame in the example because the code gets really messy with your original DataFrame as lookups without index use a lot of .loc which I personally dislike :D

    data = {
        'Polygon A': {
            'Landcover 1': 10,
            'Landcover 2': 15,
            'Landcover 3': 20
        },
        'Polygon B': {
            'Landcover 1': 5,
            'Landcover 2': 8,
            'Landcover 3': 12
        },
        'Polygon C': {
            'Landcover 1': 7,
            'Landcover 2': 4,
            'Landcover 3': 9
        },
        'Polygon D': {
            'Landcover 1': 3,
            'Landcover 2': 6,
            'Landcover 3': 14
        }
    }
    
    target_requirements = {
        'Landcover 1': 15,
        'Landcover 2': 20,
        'Landcover 3': 25
    }
    

    Then we formulate the linear problem:

    from pulp import LpProblem, LpVariable, LpMinimize, lpSum, value
    
    
    
    model = LpProblem("LandcoverOptimization", LpMinimize)
    
    # Create a binary variable for each polygon to indicate whether it was selected in the solution
    polygons = list(data.keys())
    selected = LpVariable.dicts("Selected", polygons, cat='Binary')
    
    # Minimize the amount of selected polygons
    model += lpSum(selected[polygon] for polygon in polygons)
    
    
    # We need to approximately cover the target required landcover +- tolerance
    # Also: We have to select a polygon to add it to the solution, this will be minimized by the objective
    tolerance = 0.1
    # We create the sums only once for performance
    sums = {}
    for landcover in target_requirements:
        sums[landcover] = lpSum(landcovers[polygon][landcover] * selected[polygon] for polygon in polygons)
    for landcover in target_requirements:
        model += sums[landcover] >= target_requirements[landcover] * (1 - tolerance)
        model += sums[landcover] <= target_requirements[landcover] * (1 + tolerance)
    
    model.solve(PULP_CBC_CMD(fracGap = 0.9))# Set mip gap very permissively for performance
    

    To visualize the output we extract the solution:

    selected_polygons = [polygon for polygon in polygons if value(selected[polygon]) == 1.0]
    print("Selected Polygons:")
    for polygon in selected_polygons:
        print(polygon)
    
    achieved_landcovers = {}
    for landcover in target_requirements:
        total_landcover = sum(data[polygon][landcover] for polygon in selected_polygons)
        achieved_landcovers[landcover] = total_landcover
    
    print("Required Landcovers:")
    for landcover, requirement in target_requirements.items():
        print(f"{landcover}: {requirement}")
    
    print("Achieved Landcovers:")
    for landcover, coverage in achieved_landcovers.items():
        print(f"{landcover}: {coverage}")
    

    Output:

    Selected Polygons:
    Polygon A
    Polygon C
    Required Landcovers:
    Landcover 1: 15
    Landcover 2: 20
    Landcover 3: 25
    Achieved Landcovers:
    Landcover 1: 17
    Landcover 2: 19
    Landcover 3: 29
    

    With regard to tractability:

    I've solved some MIP models with ~3 million binary variables. This question says 5 million variables won't solve for them. Playing with tolerances, as well as the optimality gap will enable you to dial into a useful and practical solution for your problem. The output will also contain logs of measures of optimality and/or prove infeasibility of the problem(e.g. if you set your tolerances too tightly) :

    Result - Optimal solution found
    
    Objective value:                2.00000000
    Enumerated nodes:               0
    Total iterations:               1
    Time (CPU seconds):             0.00
    Time (Wallclock seconds):       0.00
    
    Option for printingOptions changed from normal to all
    Total time (CPU seconds):       0.00   (Wallclock seconds):       0.01
    
    

    If the amount of data is truly huge and the tolerances need to be very tight, CBC might no longer be able to handle the problem. Commercially available solvers such as Gurobi will handle much larger problem instances, but aren't available for free.

    In my experience, if a solver can't find a satisfying solution, there is no way you can write some python/pandas code that will.

    To get the mindset of describing problems, rather than thinking of solutions I recommend Raymond Hettinger's excellent talk on PyCon 2019 - Modern solvers: Problems well-defined are problems solved

    to install PuLP

    pip install pulp
    

    EDIT: For OP's 88MB problem instance on my MacBook Pro:

    Result - Optimal solution found (within gap tolerance)
    
    Objective value:                14178.00000000
    Lower bound:                    14080.357
    Gap:                            0.01
    Enumerated nodes:               0
    Total iterations:               0
    Time (CPU seconds):             218.06
    Time (Wallclock seconds):       219.21
    
    Option for printingOptions changed from normal to all
    Total time (CPU seconds):       293.59   (Wallclock seconds):       296.26
    

    Tweaking the tolerance and MIP gap will improve runtime or solution quality.

    It can prove at least 14081 polygons need to be chosen and it found a solution that chooses 14178 polygons(1% optimality gap) with a coverage tolerance of +- 3%:

    Landcover Comparison:
    n20: Required=3303.7, Achieved=3204.7, Difference=99.0
    n25: Required=20000.0, Achieved=19401.1, Difference=598.9
    n5: Required=20000.0, Achieved=19400.0, Difference=600.0
    n16: Required=8021.1, Achieved=7781.3, Difference=239.8
    ....