apache-spark pyspark genetic-programming

How to use Spark to distribute a processing load?

All, I'll be needing to distribute some computing ( for now it is only academic ), and I was planning on using Spark to do so.

I'm now conducting some tests, and they go like this:

I have a large file with variables and sum them, line by line, and then output the result. I've made a non-Spark version as below:

def linesum(inputline):
    m=0
    for i in inputline:
        m=m+i
    return m

with open('numbers.txt', 'r') as f:
    reader = csv.reader(f, delimiter=';')
    testdata = [list(map(float, rec)) for rec in reader]
testdata_out=list()
print('input  : ' + str(testdata))
for i in testdata:
    testdata_out.append(linesum(i))
testdata=testdata_out[:]

print('output : ' + str(testdata_out))
print(len(testdata))
print('OK')

and run in a 600k line text file, then
I've made a local spark instalation, and ran the following code :

if 'SPARK_HOME' not in os.environ:
    os.environ['SPARK_HOME'] = 'C:\spark\spark-2.0.1-bin-hadoop2.7'

conf = SparkConf().setAppName('file_read_sum').setMaster('local[4]')

sc = SparkContext(conf=conf)



from pyspark.sql import SparkSession

def linesum(inputline):
    m=0
    tmpout=list()
    tmpout=[]
    for i in inputline:
        m=m+i

    return m

with open('numbers.txt', 'r') as f:
    reader = csv.reader(f, delimiter=';')
    testdata = [list(map(float, rec)) for rec in reader]

print('input  : ' + str(testdata))
print(len(testdata))


testdata_rdd = sc.parallelize(testdata, numSlices=(len(testdata)/10000))

testdata_out = testdata_rdd.map(linesum).collect()

testdata=testdata_out[:]

print('output : ' + str(testdata_out))
print(len(testdata_out))
print('OK')

The results match, but the first ( without Spark ) is much faster than the second, I've also made a distributed Spark instalation into 4 VMs and, as expected, the result is even worse.

I do understand that there is some overhead, specially when using the VMs, the questions are :

1) - Is my reasoning sound? Is Spark an appropriate tool to distribute this kind of job? ( for now, I am only summing the lines, but the lines can be VERY large and the operations may be much much more complex ( think Genetic programming fitness evaluation here ) )

2) - Is my code appropriate for distributing calculations ?

3) - How can I improve the speed of this?

Solution

A1) No, Spark may be a great tool for other tasks, but not for GP

The core idea behind the powers the GP-approaches have opened, is in zero-indoctrination of the process. It is evolutionary, it is the process' self-developed diversity of candidates ( each population member is a candidate solution, having ( very ) different fitness ( "bestness-of-fit" ) ). So most of the processing powers are right in the principle used for increasing the potential to allow to evolve the maximum width of the evolutionary search, where genetic-operations mediate self-actualisation ( via cross-over, mutation and change of architecture ) altogether with self-reproduction. The Spark is a fit for the very opposite -- for a rigid, scripted workflow, having zero-space for any evolutionary behaviour.

The richer diversity of the population members the evolutionary generator is able to scan, the better. So let the diversity grow wider and forget about tools for rigid & repetitive RDD calculus ( where RDD is the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel". Notice the word immutable. ).

Nota Bene: using VMs for testing a ( potential ) speedup of a parallel ( well, in practice not the [PARALLEL] but the "just"-(might-be-highly)-[CONCURENT] scheduling ) processing performance is unexceptionally bad idea. Why? Consuming more overheads onto shared resources ( in case of a just container-based deployment ) plus consuming additional overheads in hypervisor-service planes, next absolutely devastated all temporal-cache-localities inside VM-abstracted vCPU/vCore(s)'s L1/L2/L3-caches, all that criss-cross-chopped by the external O/S, fighting for his few CPU-CLK-ticks on the external process-scheduler, so the idea is indeed a bad, very bad anti-pattern, that may get but some super-dogmatic advertisement from cloud-protagonists support ( hard-core, technically not-adjusted PR cliché + heavy bell$ & whistle$ ), but having negative performance gains, if rigorously validated against raw silicon execution.

A2) + A3) Distributed nature of evolutionary systems depends a lot on the nature of processing ( The Job )

Given we are here about a GP, the distributed execution may best help in generation the increased width of the evolution-accelerated diversity, not in the naive code-execution.

Very beneficial in GP is the global self-robustness of the evolutionary concept -- many uncoordinated ( async ) and very independent processing nodes are typically much more powerful ( in terms of the global TFLOPs levels achieved ) and practical reports even show, that failed nodes, even in large units ( if not small tens of percent (!!) ) still do not devastate the finally achieved quality of the last epochs of the global-search across the self-evolving populations. That is a point! You will indeed love the GP if you can harness these few core principles into the light-weight async herd of distributed-computing nodes correctly & just-enough for the HPC-ruled GP/GA-searches!

The Best Next step:

To get some pieces of the very-first-hand experience, read John R. KOZA remarks on his GP distributed-processing concepts, where the +99% of the problem is actually used ( and where a CPU-bound processing deserves a maximum possible acceleration ( surprisingly not by re-distribution, right because of not willing to loose a single item's locality ) ). I am almost sure if you are serious into GP/GA, you will both like it & benefit from his pioneering work.