Search code examples
pythonnumpyscipy

Which function calculates the Sum of squared residuals?


I want to calculate the RSS from a given data set and a given fit function, but can't find an built-in function that does this.

df3  = pd.DataFrame({'x':[1,1,1], 'y':[2,3,2]})

res = np.sum(np.square(df3['x'] - df3['y'])) # what function is equivalent to this?
print(res)

All solutions on the internet either implement this by hand or get the value from the fitting process. Manually implementing the function is an option, but not preferred. I assume either numpy or scip have such function implemented, but I can't find it. Apologies, if this is already answered elsewhere and thank you in advance. Why: I would like to keep some simplicity and avoid reinventing the wheel.


Solution

  • I didn't intend to answer the question, but I made some timings and found this interesting:

    >>> a = np.array([1., 2., 3.])
    >>> b = np.array([2., 0., 3.])
    
    >>> np.sum(np.square(a - b))
    5.0
    >>> %timeit np.sum(np.square(a - b))
    2.93 µs ± 51.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
    
    >>> np.linalg.norm(a - b)**2
    5.000000000000001
    >>> %timeit np.linalg.norm(a - b)**2
    2.11 µs ± 14.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
    
    >>> ((a-b)**2).sum()
    5.0
    >>> %timeit ((a-b)**2).sum()
    1.62 µs ± 13.6 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
    
    >>> c = a - b
    >>> c @ c
    5.0
    >>> %timeit c = a - b; c @ c
    1.06 µs ± 6.25 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
    

    Edit: better performance comparison.

    def functions(a, b):
        return np.sum(np.square(a - b))
    
    def norm(a, b):
        return np.linalg.norm(a - b) ** 2
    
    def methods(a, b):
        return ((a - b) ** 2).sum()
    
    def matmul(a, b):
        c = a - b
        return c @ c
    
    perfplot.show(
        setup=lambda n: (np.random.random(n), np.random.random(n)),
        kernels=[functions, norm, methods, matmul],
        n_range=[2 ** k for k in range(22)],
        xlabel="len(a)"
        )
    

    performance plot