python python-3.x algorithm performance pi

How to accelerate the convergence of Leibniz Series?

The Leibniz series is the following infinite series that is used to approximate π. It does indeed converge to π but it does so in a notoriously slow way.

A few years ago I wrote many Python functions to calculate π, and of course I wrote a function using it (because it is in many online tutorials), but finding it inefficient, I wrote much more efficient functions that approximate π much faster, the fastest I have tested was Ramanujan's π formula, it gives 15 correct decimal places in 3 iterations (at this point double underflows).

A while back I watched this video about accelerating Leibniz Series by using correction terms, and I was intrigued, so I wrote some Python functions for the correction terms to check the video's veracity, I found that using 1 correction term the series yields 10 correct decimal places for 2048 iterations, 2 correction terms yields 15 correct decimal places for 638 iterations, 3 correction terms yields 15 correct decimal places in 131 iterations, but 4 correction terms makes the series converge slower, makes the underflow occur at 154th iteration.

I read that Leibniz Series can be accelerated:

However, the Leibniz formula can be used to calculate π to high precision (hundreds of digits or more) using various convergence acceleration techniques.

So I wanted to implement the algorithms myself, as a self-imposed programming challenge, to improve my skills (if anyone's wondering, no this is not homework), but my math isn't that good, so I put off doing it, until very recently when I read this.

I tried to implement some algorithms but I found that while they need less terms for more precise result, the computation for the same number of terms is much more expensive, so that they end up taking longer time to execute. Below is my code, I want to know, what programming techniques and convergence acceleration methods (limited to ones found on the Leibniz series Wikipedia page) I can use, to speed up the calculation of π using the Leibniz Series, both making it use less number of terms, and take less time to execute?

(Using compiled libraries is OK, but I don't want to cheat, by caching for example)

import math
from itertools import cycle
from math import factorial


def Ramanujan(n):
    def term(k): return factorial(4 * k) * (1103 + 26390 * k) / \
        (factorial(k) ** 4 * 396 ** (4 * k))
    s = sum(term(i) for i in range(n))
    return 1 / (2 * 2 ** 0.5 * s / 9801)


sign = [1, -1]


def Leibniz_Series(n):
    return [1/i*s for i, s in zip(range(1, 2*n+1, 2), cycle(sign))]


def Leibniz_0(n):
    return 4 * sum(Leibniz_Series(n))


def Leibniz_1(n):
    correction = 1 / n*sign[n % 2]
    return Leibniz_0(n) + correction


def Leibniz_2(n):
    correction = 1 / (
        n + 1 / (
            4 * n
        )
    )*sign[n % 2]
    return Leibniz_0(n) + correction


def Leibniz_3(n):
    correction = 1 / (
        n + 1 / (
            4 * n + 4 / n
        )
    )*sign[n % 2]
    return Leibniz_0(n) + correction


def Leibniz_4(n):
    correction = 1 / (
        n + 1 / (
            4 * n + 4 / (
                n + 9 / n
            )
        )
    )*sign[n % 2]
    return Leibniz_0(n) + correction


def bincoeff(n, k):
    r = 1
    if (k > n):
        return 0
    for d in range(1, k + 1):
        r = r * n / d
        n -= 1
    return r


def abs(n):
    return n * (-1) ** (n < 0)


def Euler(arr):
    out = []

    for i in range(len(arr)):
        delta = 0
        for j in range(i + 1):
            coeff = bincoeff(i, j)
            delta += (-1) ** j * coeff * abs(arr[j])

        out.append(0.5 ** (i + 1) * delta)
    return out


def Leibniz_Euler(n):
    return 4 * sum(Euler(Leibniz_Series(n)))


def Leibniz_Aitken(n):
    series = Leibniz_Series(n)
    s0, s1, s2 = (sum(series[:n-i]) for i in (2, 1, 0))
    return 4 * (s2 * s0 - s1 * s1) / (s2 - 2*s1 + s0)


def LCP(s1, s2):
    i = 0
    for a, b in zip(s1, s2):
        if a != b:
            break
        i += 1
    return i


for func in (Ramanujan, Leibniz_Euler, Leibniz_Aitken, Leibniz_4, Leibniz_3, Leibniz_2, Leibniz_1, Leibniz_0):
    i = 12
    last = 0
    while (pi := func(i)) != math.pi:
        i += 1
        if i == 2048:
            if abs(pi - math.pi) <= 1e-6:
                print('the calculated result is close')
            else:
                print('the method is too slow')
            break
        if pi == last:
            print('float underflow')
            break
        last = pi
    print(
        f'function: {func.__name__}, iterations: {i}, correctness: {LCP(str(pi), str(math.pi))}')

Performance:

function: Ramanujan, iterations: 12, correctness: 17
float underflow
function: Leibniz_Euler, iterations: 52, correctness: 16
the calculated result is close
function: Leibniz_Aitken, iterations: 2048, correctness: 11
function: Leibniz_4, iterations: 154, correctness: 17
function: Leibniz_3, iterations: 131, correctness: 17
function: Leibniz_2, iterations: 638, correctness: 17
the calculated result is close
function: Leibniz_1, iterations: 2048, correctness: 12
the method is too slow
function: Leibniz_0, iterations: 2048, correctness: 4

In [2]: %timeit Ramanujan(3)
3.7 µs ± 229 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [3]: %timeit Leibniz_3(131)
17.4 µs ± 123 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [4]: %timeit Leibniz_Aitken(2048)
295 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [5]: %timeit Leibniz_Euler(52)
3.81 ms ± 430 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Solution

Reducing the number of terms

You can compress the series by computing terms 2 by 2. Indeed, 1/a - 1/(a+2) = 2/(a(a+2)). Since, 2 is a constant here, you can compute the sum of the terms : 2 * (1/(1*3) + 1/(5*7) + 1/(9*11) + ...). Note that each term is a bit more expensive to compute but the slowest operation is actually the division and there are twice less divisions to compute for a similar target precision. If we try to apply the same method to compress the series further (by packing terms 4 by 4), then the terms starts to be significantly more complex to compute so the benefit is not so big. Indeed, for each group of 4 terms, we can compute b, c, d = a+2, a+4, a+6 and then compute compressed terms using 2 (a b + c d) / (a b c d). Note the constant 2 can still be factorized when computing multiple terms. With this last method, there are 5 mul + 1 add + 1 div per term while the previous method required 2 mul + 1 add + 2 div per pair of terms (similar precision). Thus, it does not worth it unless the cost of the division is huge compared multiplication (i.e. cost(1 div) > cost(3 mul)). This is actually the case on mainstream x86-64 processor (e.g. 4 cycle/div versus 0.5 cycle/mul for scalar native computations on relatively-recent Intel processors). Compressing the series further is just not worth it on all mainstream architecture in practice, not to mention it makes the terms far more complex.

The same logic is true for Leibniz_Euler and Leibniz_Aitken. You can strongly reduce the number of terms to compute easily but generally at the expense of more expensive terms to compute (unless the target series is trivial). This methods can be generalized to any algorithm or even processor and programming languages. More specifically, it can be applies to Turing machines (i.e. acceleration theorem). Again, in this case, it requires an exponential number of states and/or symbols to compress a given Turing machine (unless it can be simplified, so unless it is a trivial one). There is no free lunch.

However, note that Euler has a sub-optimal time complexity : it runs in O(n**3) time while it can be computed in O(n**2) time. Indeed, the bincoeff values can be precomputed using the Pascal's triangle. It should significantly speed up the computation so this method can be about as fast as (or a bit faster then) the Leibniz_Aitken one.

Computing terms much faster

CPython is an interpreter and interpreters are very slow for such a computation. You should avoid CPython if you want things to be computed quickly. There are many solutions to do that. The most famous one is using vectorization, that is calling native (C/C++) functions to compute most of the work without the interpreter. Numpy is a good example for that. Another method is to use compilers so to generate native functions from a Python codes. Numba and Cython are example of modules doing this very well. The later is generally more efficient than the former (especially if you are a skilled developer).

Based on the compressed series described above, we can write a very-fast compiled implementation (compared to the reference implementation):

@nb.njit('float64(int64)')
def Leibniz_fast(n):
    iMax = 2*n+1

    # Compressed terms
    s1 = 0.0
    iCompMax = (2*n+1-8)//8*8+1
    i = iCompMax
    while i >= 1:
        a = float(i)
        b = a + 2
        c = a + 4
        d = a + 6
        s1 += (a*b + c*d) / ((a*b)*(c*d))
        i -= 8

    # Remaining non-compressed terms
    s2 = 0.0
    if iCompMax +  8 < iMax:  s2 += 1.0 / (iCompMax +  8)
    if iCompMax + 10 < iMax:  s2 -= 1.0 / (iCompMax + 10)
    if iCompMax + 12 < iMax:  s2 += 1.0 / (iCompMax + 12)

    # Correction
    sign = 1 if n % 2 == 0 else -1
    correction = sign / (n + 1 / (4 * n + 4 / n))   # From Leibniz_3

    return 8 * s1 + 4 * s2 + correction

Performance benchmark

Here are performance results on my i5-9600KF processor with CPython 3.8.1.

Leibniz_3(131):      12.6 µs
Leibniz_fast(131):    0.2 µs

Leibniz_3(500):      47.6 µs
Leibniz_fast(500):    0.3 µs

Leibniz_fast is about ~100 times faster. Most of the speed-up is coming from the compilation of the code.

Note about the accuracy of the solutions

The precision of Leibniz_fast is close to Leibniz_3. Both achieve an accuracy of few ULP for small n values. For large n values, both are quite inaccurate because of the sum accumulating errors. Note Leibniz_fast compute compressed terms in a reverse order so to accumulate small values together and try to add values of the same order. That being said, this is not enough to get a very accurate solution for large n values. One need to perform a more accurate sum for example using chunks or even using a Kahan summation algorithm at the expense of a slower computation. An alternative solution is to use libraries like GMP (faster than the default decimal package) so to compute numbers with an arbitrary precision though it is much slower than operating on double-precision numbers.

The number of terms required for Leibniz_fast to reach the same accuracy than Leibniz_3 is 4 times smaller (less than 40 terms are needed for double-precision numbers). Indeed, terms are computed 4 by 4 with a step of 8 in the while loop.

If you want to compute precisely pi, then the Ramanujan–Sato series (mentioned in the question) is a very good one when combined with libraries like GMP. The result is both very-fast and precise. Note the factorial terms can be precomputed so to improve the complexity compared to your naive implementation. Besides some factorial terms can also be skipped so to make the number smaller in memory (this is critical since the numbers grows exponentially and the amount of RAM used can quickly be an issue).