Search code examples
rvectorsumna

Rationale for the sum of an empty vector to return 0


Disclaimer: I am not asking how to bypass this behaviour, but rather would like explanation as to why this is the default behaviour.

I came across a behaviour of the rowSums() and sum() functions in R whose rationale escapes my understanding.

Let mat a matrix whose rows have i) no NA, ii) one NA, iii) only NA's, iv) only 0:

> mat <- matrix(c(1:8, rep(NA, 4), rep(0, 3)), ncol=3, byrow=T)
> mat
[,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8   NA
[4,]   NA   NA   NA
[5,]    0    0    0

Computing the row sums with rowSums() gives:

> rowSums(mat)
[1]  6 15 NA NA  0

Fair enough, if we have any NA in the row, it'll return NA.
Let's tell rowSums() to remove NA's in the calculation:

> rowSums(mat, na.rm=T)
[1]  6 15 15  0  0

We see that a NA-only vector has a sum of zero if na.rm=T, which is confirmed by doing the sum of an empty vector.

> sum(c())
[1] 0

I cannot understand the rationale behind returning zero for the sum of an empty vector (following na.rm=T). Is it an arbitrary decision?

I was expecting the sum of an empty vector to return NA, for two reasons:

  • It would make the distinction between a NA-only vector input, and a 0-only vector input.
    This seems fundamental to me as 0 means "no signal", and NA means "no measure/not possible".

  • Assessment of the argument type:
    The sum() function blindly assumes that an empty vector is somehow of type numeric since it returns zero. Why?


Solution

  • 0 is the only sensible value. First off, because it's the neutral element of addition. Mathematically, this is all the justification we need, really.1

    But secondly, your reasoning doesn’t make sense:

    • It would make the distinction between a NA-only vector input, and a 0-only vector input.

    We are talking about an empty vector. It’s neither NA-only nor 0-only. So, right out of the door, this reason doesn’t apply here.

    • The sum() function blindly assumes that an empty vector is somehow of type numeric since it returns zero.

    The sum() function does no such thing:

    typeof(sum(integer(0L)))
    # [1] "integer"
    

    What you are observing is that c() is of type numeric. This is unrelated to sum().


    1 There are two potential objections to this:

    1. Even in mathematics, it’s merely a convention in certain domains. Yes, but this convention is extremely widespread, with no real alternatives, and exists for good reasons because it makes large swathes of mathematics consistent and simple.

    2. Why do we care about mathematics? And again the reason comes back to the fact that the mathematical convention make sense, and the consistency it provides permeates into our code. Put differently, the fact that the sum of an empty vector is 0 drastically simplifies most code. If this weren’t the case, we’d have to special-case empty vectors almost anywhere a sum is computed.