Search code examples
ralgorithmperformanceprimescomputation

Why am I getting only marginal improvements in time when I cut the computations in half?


I wanted to write a program which will print and count all the primes between 1 and k. The idea is that we take each odd number i between 1 and k and check for odd factors of i up to sqrt(i). It took my computer 1 minute 47 seconds to go check for primes up to k = 1 000 000.

Then I tried to improve my code: instead of checking all odd numbers from 1 to k, I removed all the multiples of 3, 5, and 7 from these odd numbers, so basically I removed about half the numbers we have to check. I ran the program again, and with k = 1 000 000 I got a time of 1 minute and 40 seconds, only a 7 second improvement.

Here is my code:

   #The program has some weird quirks 
#(it doesn't print 3 and 5 because of some errors I was getting) but other than that 
#it seems to work fine

#k is the number that we check up to
k <- 1000000

#n is the number of primes, initialized at 2
n <- 2

#We take only the odd numbers between 5 and k
y <- seq(5, k, 2)

#We take each member of i and check it for primality
for (i in y) {
  #i is assumed to be prime until proved otherwise
  prime <- TRUE

  #We check the remainder when i is divided by every odd number less than its square root
  for (j in c(2, seq(3, ceiling(sqrt(i)), 2))) {
    if (i %% j == 0) {

      #If it's found that some number divides i, we set prime to false and move on 
      #to the next i
      prime <- FALSE
      break
    }
  }
  #If no number has divided i, we haven't broken the loop so R will get to this point
  #We shouldn't need the if statement, but for some reason the n counter 
  #gets too big if I remove the statement
  if (prime) {
    print(i)
    n <- n + 1
  }
}

#Print out the number of primes
print(paste("There are", n, "prime numbers between", 1, "and", k))

In the version where I also remove multiples of 3, 5, and 7, I just define y as

y <- setdiff((setdiff((setdiff(seq(5, k, 2), seq(6, k, 3))), seq(10, k, 5))), seq(14, k, 7))

Sorry it's very messy, but it works.


Solution

  • Profile your code to find out where the bulk of the time is. As a general rule, don't use generics that have to determine their methods. Swap seq.int for seq. Especially in R, you can't just count the number of computations you think something will take and cut down on that. Actually look at how things are called.

    Here's how to profile your code:

    Call Rprof(tmp <- tempfile()), then execute your algorithm, then call Rprof();summaryRprof(tmp). Look at the calls with a high total.pct, and make your adjustments there.

    If you want a fast function for quickly determining all the primes up to some number in R, here's one for you. I don't think I wrote it, but my file doesn't tell me where I got it. It's an implementation of the sieve of Eratosthenes.

    sieve <- function(n){
      n <- as.integer(n)
      primes <- rep(TRUE, n)
      primes[1] <- FALSE
      last.prime <- 2L
      fsqr <- floor(sqrt(n))
      while (last.prime <= fsqr){
        primes[seq.int(2L*last.prime, n, last.prime)] <- FALSE
        sel <- which(primes[(last.prime+1):(fsqr+1)])
        if(any(sel)){
          last.prime <- last.prime + min(sel)
        }else last.prime <- fsqr+1
      }
      which(primes)
    }
    

    EDIT: I most certainly didn't write this, though my memory made me think I might have implemented it in R from some other language, this is not the case. A quick google search of the code shows that I got it directly from a post here several years back when I was just a lurker. The first sieve answer there was posted by @GeorgeDontas: https://stackoverflow.com/a/3790309/7936744

    EDIT 2: Now looking back through my files, I see my very minimal tweaks can eke a little bit more time out of this, and, probably unreliably, avoids a rare factor of 10 increase:

    prime_logic <- function(n){
      n <- as.integer(n)
      primes <- rep_len(TRUE, n)
      primes[1] <- FALSE
      last.prime <- 2L
      fsqr <- floor(sqrt(n))
      while (last.prime <= fsqr){
        primes[seq.int(2L*last.prime, n, last.prime)] <- FALSE
        sel <- which(primes[seq.int(last.prime+1, fsqr+1)])
        if(any(sel)){
          last.prime <- last.prime + min(sel)
        }else last.prime <- fsqr+1
      }
      primes
    }
    
    Unit: milliseconds
                   expr      min       lq     mean   median       uq       max neval cld
           sieve(1e+06) 29.80531 36.41955 50.24195 38.53144 41.86716 889.47422   100   a
     prime_logic(1e+06) 22.19734 33.05792 39.35390 35.52268 41.02838  71.68033   100   a