Search code examples
pythonsympysymbolic-mathdifferential-equationsautograd

Recursive symbolic calculations - improve the performance


In my research I'm trying to tackle the Kolmogorov backward equation, i.e. interested in $$Af = b(x)f'(x)+\sigma(x)f''(x)$$

With the specific b(x) and \sigma(x), I'm trying to see how fast the coefficients of the expression are growing when calculating higher Af powers. I'm struggle to derive this analytically thus tried to see the trend empirically.

First, I have used sympy:

from sympy import *
import matplotlib.pyplot as plt
import re
import math
import numpy as np
import time
np.set_printoptions(suppress=True)

x = Symbol('x')
b = Function('b')(x)
g = Function('g')(x)

def new_coef(gamma, beta, coef_minus2, coef_minus1, coef):
    return expand(simplify(gamma*coef_minus2 + beta*coef_minus1 + 2*gamma*coef_minus1.diff(x)\
            +beta*coef.diff(x)+gamma*coef.diff(x,2)))
def new_coef_first(gamma, beta, coef):
    return expand(simplify(beta*coef.diff(x)+gamma*coef.diff(x,2)))
def new_coef_second(gamma, beta, coef_minus1, coef):
    return expand(simplify(beta*coef_minus1 + 2*gamma*coef_minus1.diff(x)\
            +beta*coef.diff(x)+gamma*coef.diff(x,2)))
def new_coef_last(gamma, beta, coef_minus2):
    return lambda x: gamma(x)*coef_minus2(x)
def new_coef_last(gamma, beta, coef_minus2):
    return expand(simplify(gamma*coef_minus2 ))
def new_coef_second_to_last(gamma, beta, coef_minus2, coef_minus1):
    return expand(simplify(gamma*coef_minus2 + beta*coef_minus1 + 2*gamma*coef_minus1.diff(x)))

def set_to_zero(expression):
    expression = expression.subs(Derivative(b, x, x, x), 0)
    expression = expression.subs(Derivative(b, x, x), 0)
    expression = expression.subs(Derivative(g, x, x, x, x), 0)
    expression = expression.subs(Derivative(g, x, x, x), 0)
    return expression

def sum_of_coef(expression):
    sum_of_coef = 0
    for i in str(expression).split(' + '):
        if i[0:1] == '(':
            i = i[1:]
        integers = re.findall(r'\b\d+\b', i)
        if len(integers) > 0:
            length_int = len(integers[0])
            if i[0:length_int] == integers[0]:
                sum_of_coef += int(integers[0])
            else:
                sum_of_coef += 1
        else:
            sum_of_coef += 1
    return sum_of_coef
power = 6
charar = np.zeros((power, power*2), dtype=Symbol)
coef_sum_array = np.zeros((power, power*2))
charar[0,0] = b
charar[0,1] = g
coef_sum_array[0,0] = 1
coef_sum_array[0,1] = 1
for i in range(1, power):
    #print(i)
    for j in range(0, (i+1)*2):
        #print(j, ':')
        #start_time = time.time()
        if j == 0:
            charar[i,j] = set_to_zero(new_coef_first(g, b, charar[i-1, j]))
        elif j == 1:
            charar[i,j] = set_to_zero(new_coef_second(g, b, charar[i-1, j-1], charar[i-1, j]))
        elif j == (i+1)*2-2:
            charar[i,j] = set_to_zero(new_coef_second_to_last(g, b, charar[i-1, j-2], charar[i-1, j-1]))
        elif j == (i+1)*2-1:
            charar[i,j] = set_to_zero(new_coef_last(g, b, charar[i-1, j-2]))
        else:
            charar[i,j] = set_to_zero(new_coef(g, b, charar[i-1, j-2], charar[i-1, j-1], charar[i-1, j]))
        #print("--- %s seconds for expression---" % (time.time() - start_time))
        #start_time = time.time()
        coef_sum_array[i,j] = sum_of_coef(charar[i,j])
        #print("--- %s seconds for coeffiecients---" % (time.time() - start_time))
coef_sum_array

Then, looked into automated differentiation and used autograd:

import autograd.numpy as np
from autograd import grad
import time
np.set_printoptions(suppress=True)

b = lambda x: 1 + x
g = lambda x: 1 + x + x**2

def new_coef(gamma, beta, coef_minus2, coef_minus1, coef):
    return lambda x: gamma(x)*coef_minus2(x) + beta(x)*coef_minus1(x) + 2*gamma(x)*grad(coef_minus1)(x)\
            +beta(x)*grad(coef)(x)+gamma(x)*grad(grad(coef))(x)
def new_coef_first(gamma, beta, coef):
    return lambda x: beta(x)*grad(coef)(x)+gamma(x)*grad(grad(coef))(x)
def new_coef_second(gamma, beta, coef_minus1, coef):
    return lambda x: beta(x)*coef_minus1(x) + 2*gamma(x)*grad(coef_minus1)(x)\
            +beta(x)*grad(coef)(x)+gamma(x)*grad(grad(coef))(x)
def new_coef_last(gamma, beta, coef_minus2):
    return lambda x: gamma(x)*coef_minus2(x)
def new_coef_second_to_last(gamma, beta, coef_minus2, coef_minus1):
    return lambda x: gamma(x)*coef_minus2(x) + beta(x)*coef_minus1(x) + 2*gamma(x)*grad(coef_minus1)(x)

power = 6
coef_sum_array = np.zeros((power, power*2))
coef_sum_array[0,0] = b(1.0)
coef_sum_array[0,1] = g(1.0)
charar = [b, g]
for i in range(1, power):
    print(i)
    charar_new = []
    for j in range(0, (i+1)*2):
        if j == 0:
            new_funct = new_coef_first(g, b, charar[j])
        elif j == 1:
            new_funct = new_coef_second(g, b, charar[j-1], charar[j])
        elif j == (i+1)*2-2:
            new_funct = new_coef_second_to_last(g, b, charar[j-2], charar[j-1])
        elif j == (i+1)*2-1:
            new_funct = new_coef_last(g, b, charar[j-2])
        else:
            new_funct = new_coef(g, b, charar[j-2], charar[j-1], charar[j])
        coef_sum_array[i,j] = new_funct(1.0)
        charar_new.append(new_funct)
    charar = charar_new
coef_sum_array

However, I'm so not happy with the speed of either of them. I would like to do at least thousand iterations while after 3 days of running simpy method, I got 30 :/

I expect that the second method (numerical) could be optimized to avoid recalculating expressions every time. Unfortunately, I cannot see that solution myself. Also, I have tried Maple, but again without luck.


Solution

  • Overview

    So, there are two formulas about derivatives that are interesting here:

    1. Faa di Bruno's formula which is a way to quickly find the n-th derivative of f(g(x)) , and looks a lot like the Multinomial theorem
    2. The General Leibniz rule which is a way to quickly find the n-th derivative of f(x)*g(x) and looks a lot like the Binomial theorem

    Both of these have been discussed in pull request #13892 the n-th derivative was sped up using the general Leibniz rule.

    I'm trying to see how fast the coefficients of the expression are growing

    In your code, the general formula for computing c[i][j] is this:

    c[i][j] = g * c[i-1][j-2] + b * c[i-1][j-1] + 2 * g * c'[i-1][j-1] + g * c''[i-1][j]

    (where by c'[i][j] and c''[i][j] are the 1st and 2nd derivatives of c[i][j])

    Because of this, and by the Leibniz rule mentioned above, I think intuitively, the coefficients computed should be related to Pascal's triangle (or at the very least they should have some combinatorial relation).

    Optimization #1

    In the original code, the function sum_to_coef(f) serializes the expression f to a string and then discarding everything that doesn't look like a number, and then sums the remaining numbers.

    We can avoid serialization here by just traversing the expression tree and collecting what we need

    def sum_of_coef(f):
        s = 0
        if f.func == Add:
            for sum_term in f.args:
                res = sum_term if sum_term.is_Number else 1
                if len(sum_term.args) == 0:
                    s += res
                    continue
                first = sum_term.args[0]
                if first.is_Number == True:
                    res = first
                else:
                    res = 1
                s += res
        elif f.func == Mul:
            first = f.args[0]
            if first.is_Number == True:
                s = first
            else:
                s = 1
        elif f.func == Pow:
            s = 1
        return s
    

    Optimization #2

    In the function set_to_zero(expr) all the 2nd and 3rd derivatives of b, and the 3rd and 4th derivatives of g are replaced by zero.

    We can collapse all those substitutions into one statement like so:

    b3,b2=b.diff(x,3),b.diff(x,2)
    g4,g3=g.diff(x,4),g.diff(x,3)
    
    def set_to_zero(expression):
        expression = expression.subs({b3:0,b2:0,g4:0,g3:0})
        return expression
    

    Optimization #3

    In the original code, for every cell c[i][j] we're calling simplify. This turns out to have a big impact on performance but actually we can skip this call, because fortunately our expressions are just sums of products of derivatives or unknown functions.

    So the line

    charar[i,j] = set_to_zero(expand(simplify(expr)))
    

    becomes

    charar[i,j] = set_to_zero(expand(expr))
    

    Optimization #4

    The following was also tried but turned out to have very little impact.

    For two consecutive values of j, we're computing c'[i-1][j-1] twice.

    j-1       c[i-1][j-3] c[i-1][j-2] c[i-1][j-1]
      j                   c[i-1][j-2] c[i-1][j-1] c[i-1][j]
    

    If you look at the loop formula in the else branch, you see that c'[i-1][j-1] has already been computed. It can be cached, but this optimization has little effect in the SymPy version of the code.

    Here it's also important to mention that it's possible to visualize the call tree of SymPy involved in computing these derivatives. It's actually larger, but here is part of it:

    nth derivative call tree

    We can also generate a flamegraph using the py-spy module just to see where time is being spent:

    flamegraph

    As far as I could tell, 34% of the time spent in _eval_derivative_n_times , 10% of the time spent in the function getit from assumptions.py , 12% of the time spent in subs(..) , 12% of the time spent in expand(..)

    Optimization #5

    Apparently when pull request #13892 was merged into SymPy, it also introduced a performance regression.

    In one of the comments regarding that regression, Ondrej Certik recommends using SymEngine to improve performance of code that makes heavy use of derivatives.

    So I've ported the code mentioned to SymEngine.py and noticed that it runs 98 times faster than the SymPy version for power=8 ( and 4320 times faster for power=30)

    The required module can be installed via pip3 install --user symengine.

    #!/usr/bin/python3
    from symengine import *
    import pprint
    x=var("x")
    b=Function("b")(x)
    g=Function("g")(x)
    
    b3,b2=b.diff(x,3),b.diff(x,2)
    g4,g3=g.diff(x,4),g.diff(x,3)
    
    def set_to_zero(e):
        e = e.subs({b3:0,b2:0,g4:0,g3:0})
        return e
    
    def sum_of_coef(f):
        s = 0
        if f.func == Add:
            for sum_term in f.args:
                res = 1
                if len(sum_term.args) == 0:
                    s += res
                    continue
                first = sum_term.args[0]
                if first.is_Number == True:
                    res = first
                else:
                    res = 1
                s += res
        elif f.func == Mul:
            first = f.args[0]
            if first.is_Number == True:
                s = first
            else:
                s = 1
        elif f.func == Pow:
            s = 1
        return s
    
    def main():
        power = 8
        charar = [[0] * (power*2) for x in range(power)]
        coef_sum_array = [[0] * (power*2) for x in range(power)]
        charar[0][0] = b
        charar[0][1] = g
        init_printing()
        for i in range(1, power):
            jmax = (i+1)*2
            for j in range(0, jmax):
                c2,c1,c0 = charar[i-1][j-2],charar[i-1][j-1],charar[i-1][j]
                #print(c2,c1,c0)
                if   j == 0:
                    expr =                                b*c0.diff(x) + g*c0.diff(x,2)
                elif j == 1:
                    expr =        b*c1 + 2*g*c1.diff(x) + b*c0.diff(x) + g*c0.diff(x,2)
                elif j == jmax-2:
                    expr = g*c2 + b*c1 + 2*g*c1.diff(x)
                elif j == jmax-1:
                    expr = g*c2
                else:
                    expr = g*c2 + b*c1 + 2*g*c1.diff(x) + b*c0.diff(x) + g*c0.diff(x,2)
                charar[i][j] = set_to_zero(expand(expr))
                coef_sum_array[i][j] = sum_of_coef(charar[i][j])
    
        pprint.pprint(Matrix(coef_sum_array))
    
    main()
    

    Performance after optimization #5

    I think it would be very interesting to look at the number of terms in c[i][j] to determine how quickly the expressions are growing. That would definitely help in estimating the complexity of the current code.

    But for practical purposes I've plotted the current time and memory consumption of the SymEngine code above and managed to get the following chart:

    performance_modif6

    Both the time and the memory seem to be growing polynomially with the input (the power parameter in the original code).

    The same chart but as a log-log plot can be viewed here:

    performance_modif6_loglog

    Like the wiki page says, a straight line on a log-log plot corresponds to a monomial. This offers a way to recover the exponent of the monomial.

    So if we consider two points N=16 and N=32 between which the log-log plot looks like a straight line

    import pandas as pd
    df=pd.read_csv("modif6_bench.txt", sep=',',header=0)
    
    def find_slope(col1,col2,i1,i2):
        xData = df[col1].to_numpy()
        yData = df[col2].to_numpy()
        
        x0,x1 = xData[i1],xData[i2]
        y0,y1 = yData[i1],yData[i2]
        
        m = log(y1/y0)/log(x1/x0)
        return m
    
    print("time slope = {0:0.2f}".format(find_slope("N","time",16,32)))
    print("memory slope = {0:0.2f}".format(find_slope("N","memory",16,32)))
    

    Output:

    time slope = 5.69
    memory slope = 2.62
    

    So very rough approximation of time complexity would be O(n^5.69) and an approximation of space complexity would be O(2^2.62).

    There are more details about deciding whether the growth rate is polynomial or exponential here (it involves drawing a semi-log and a log-log plot, and seeing where the data shows up as a straight line).

    Performance with defined b and g functions

    In the first original code block, the functions b and g were undefined functions. This means SymPy and SymEngine didn't know anything about them.

    The 2nd original code block defines b=1+x and g=1+x+x**2 . If we run all of this again with known b and g the code runs much faster, and the running time curve and the memory usage curves are better than with unknown functions

    enter image description here

    enter image description here

    time slope = 2.95
    memory slope = 1.35
    

    Recorded data fitting onto known growth-rates

    I wanted to look a bit more into matching the observed resource consumption(time and memory), so I wrote the following Python module that fits each growth rate (from a catalog of common such growth rates) to the recorded data, and then shows the plot to the user.

    It can be installed via pip3 install --user matchgrowth

    When run like this:

    match-growth.py --infile ./tests/modif7_bench.txt --outfile time.png --col1 N --col2 time --top 1
    

    It produces graphs of the resource usage, as well as the closest growth rates it matches to. In this case, it finds the polynomial growth to be closest:

    enter image description here

    Other notes

    If you run this for power=8 (in the symengine code mentioned above) the coefficients will look like this:

    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
    [1, 5, 4, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
    [1, 17, 40, 31, 9, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
    [1, 53, 292, 487, 330, 106, 16, 1, 0, 0, 0, 0, 0, 0, 0, 0]
    [1, 161, 1912, 6091, 7677, 4693, 1520, 270, 25, 1, 0, 0, 0, 0, 0, 0]
    [1, 485, 11956, 68719, 147522, 150706, 83088, 26573, 5075, 575, 36, 1, 0, 0, 0, 0]
    [1, 1457, 73192, 735499, 2568381, 4118677, 3528928, 1772038, 550620, 108948, 13776, 1085, 49, 1, 0, 0]
    [1, 4373, 443524, 7649215, 42276402, 102638002, 130209104, 96143469, 44255170, 13270378, 2658264, 358890, 32340, 1876, 64, 1]
    

    So as it turns out, the 2nd column coincides with A048473 which according to OEIS is "The number of triangles (of all sizes, including holes) in Sierpiński's triangle after n inscriptions".

    All the code for this is also available in this repo.