Optimizing bigint calls

I am currently using the book 'Programming in D' for learning D. I tried to solve a problem of summing up the squares of numbers from 1 to 10000000. I first made a functional approach to solve the problem with the map and reduce but as the numbers get bigger I have to cast the numbers to bigint to get the correct output.

  long num = 10000001;
  BigInt result;
  result = iota(1,num).map!(a => to!BigInt(a * a)).reduce!((a,b) => (a + b));
  writeln("The sum is : ", result);

The above takes 7s to finish when compiled with dmd -O . I profiled the program and most of the time is wasted on BigInt calls. Though the square of the number can fit into a long I have to typecast them to bigint so that reduce function sums and returns the appropriate sum. The python program takes only 3 seconds to finish it. When num = 100000000 D program gets to 1 minute and 13 seconds to finish. Is there a way to optimize the calls to bigint. The products can themselves be long but they have to be typecasted as bigint objects so that they give right results from reduce operations. I tried pushing the square of the numbers into a bigint array but its also slower. I tried to typecast all the numbers as Bigint

auto bigs_map_nums = iota(1,num).map!(a => to!BigInt(a)).array;
auto bigs_map = sum(bigs_map_nums.map!(a => (a * a)).array);

But its also slower. I read the answers at How to optimize this short factorial function in scala? (Creating 50000 BigInts). Is it a problem with the implementation of multiplication for bigger integers in D too ? Is there a way to optimize the function calls to BigInt ?

python code :

timeit.timeit('print sum(map(lambda num : num * num, range(1,10000000)))',number=1)
333333283333335000000
3.58552622795105

The code was executed on a dual-core 64 bit linux laptop with 2 GB RAM. python : 2.7.4 dmd : DMD64 D Compiler v2.066.1

Solution

Without range coolness: foreach(x; 0 .. num) result += x * x;

With range cool(?)ness:

import std.functional: reverseArgs;
result = iota(1, num)
    .map!(a => a * a)
    .reverseArgs!(reduce!((a, b) => a + b))(BigInt(0) /* seed */);

The key is to avoid BigInting every element, of course.

The range version is a little slower than the non-range one. Both are significantly faster than the python version.

Edit: Oh! Oh! It can be made much more pleasant with std.algorithm.sum:

result = iota(1, num)
    .map!(a => a * a)
    .sum(BigInt(0));