numpy parallel-processing vectorization theano

I want to map a function to each element of a vector in Theano, can I do it without using scan?

Say a function that counts the appearances of ones at each index of an array:

import theano
import theano.tensor as T


A = T.vector("A")
idx_range = T.arange(A.shape[0])

result, updates = theano.scan(fn=lambda idx: T.sum(A[:idx+1]), sequences=idx_range)

count_ones = theano.function(inputs=[A], outputs=result)

print count_ones([0,0,1,0,0,1,1,1])
# gives [ 0.  0.  1.  1.  1.  2.  3.  4.]

As said here, using scan may not be efficient. Plus, theano.scan always produces "RuntimeWarning: numpy.ndarray size changed, may indicate binary incompatibility from scan_perform.scan_perform import *" on my machine.

So I was wondering is there a better way of mapping functions in Theano?
Thanks in advance.

Edit:
I just realized it is a terrible example, apparently there's a more efficient way of just looping over the vector once like:

result, updates = theano.scan(fn=lambda prior_result, a: prior_result + a,
                              outputs_info=T.alloc(np.int32(0), 1),
                              sequences=A,
                              n_steps=A.shape[0])

However according to @Daniel Renshaw's answer, since

the computation in one step is dependent on the same computation at some earlier step

so actually I can't avoid using scan in this regard, right?

Edit:
I thought of a way of vercotrizing it as:

A = T.vector()
in_size = 8
# a matrix with ones at and below the given diagonal and zeros elsewhere
mask = theano.shared(numpy.tri(in_size))  
result = T.dot(mask, A)
count_ones = theano.function(inputs=[A], outputs=result)
print count_ones(numpy.asarray([0,0,1,0,0,1,1,1]))

But in this case I have to know the size of the input in advance (unless I can craft numpy.tri like matrices on the fly?).
Any suggestions would be welcome. :)

Edit:
I benchmarked the three methods using a 512D input array and 10000 iterations, and got the following results:

map a sum function to each element: CPU 16s GPU 140s
loop over the array using scan: CPU 13s GPU 32s
vectorization: CPU 0.8s GPU 0.8s (actually I don't think theano has engaged GPU to do this

Solution

In the most general case, if no assumptions are made about the function, then scan would have to be used. However, many (maybe most?) useful functions can be vectorized such that scan is not needed. As is pointed out in the question edit, the example function can certainly be applied to the input without using scan.

To decided whether scan is needed will depend on the function that needs to be applied. Case that certainly will require scan are those when the computation in one step is dependent on the same computation at some earlier step.

P.S. The warning about binary incompatibility can be safely ignored.