python algorithm performance optimization

How to efficiently perform dynamic programming with complex state dependencies in Python?

I am working on a Python project that involves implementing a dynamic programming (DP) algorithm, but the state dependencies are not straightforward. Here's a simplified version of my problem:

I need to calculate the minimum cost to traverse a 2D grid where each cell has a cost, but the movement rules are unusual:

You can move down, right, or diagonally down-right. Moving diagonally has an extra penalty depending on the sum of the costs of the starting and ending cells. Additionally, the cost to move into a cell may depend on whether the previous move was horizontal, vertical, or diagonal. For example: If grid[i][j] is the cost of cell (i, j), then the cost to reach (i, j) from (i-1, j-1) (diagonal) would be:

dp[i][j] = dp[i-1][j-1] + grid[i][j] + penalty_function(grid[i-1][j-1], grid[i][j])

But from (i-1, j) (vertical), it would simply be:

dp[i][j] = dp[i-1][j] + grid[i][j]

I attempted the following approach:

def min_cost(grid):
    rows, cols = len(grid), len(grid[0])
    dp = [[float('inf')] * cols for _ in range(rows)]
    dp[0][0] = grid[0][0]  # Starting point

    for i in range(1, rows):
        for j in range(1, cols):
            vertical = dp[i-1][j] + grid[i][j]
            horizontal = dp[i][j-1] + grid[i][j]
            diagonal = dp[i-1][j-1] + grid[i][j] + penalty_function(grid[i-1][j-1], grid[i][j])
            dp[i][j] = min(vertical, horizontal, diagonal)

    return dp[-1][-1]

However, this becomes inefficient for larger grids because the penalty function itself can be computationally expensive, and the solution doesn't scale well when the grid size exceeds 1000x1000.

Is there a way to optimize this DP approach, possibly by memoizing or precomputing parts of the penalty function? Would switching to libraries like NumPy or using parallel processing help in this scenario? Are there Python-specific tricks (e.g., @functools.lru_cache, generators) that I could use to improve performance while keeping the code clean and readable?

Solution

DP can proceed in two ways.

The first is bottom up. That's harder but often more efficient on memory.

The second is top down. Just write a recursive function, then memoize it.

If you're struggling with bottom up, just try top down.

In your case, though, this will on a 1000x1000 grid require a million data values, which you access in strange patterns. Bottom up makes more sense. Instead of the previous suggestion of rows, I would suggest diagonals. At the worst point it is the same as rows. But usually it is smaller, improving cache usage.