Search code examples
scalaapache-sparkdistributed-computing

Is it possible to use spark to process complex entities with complex dependencies?


Consider a scenario (objects and dependencies are Scala classes):

There is a set of dependencies which themselves require significant amount of data to be instantiated (data coming from a database). There is a set of objects with complex nested hierarchy which store references to those dependencies.

The current workflow consist of:

  1. Loading the dependencies data from a database and instantiating them (in a pretty complex way with interdependencies).
  2. Loading object data from the database and instantiating objects using previously loaded dependencies.
  3. Running operations on a list of objects like:

    a. Search with a complex predicate
    b. Transform
    c. Filter
    d. Save to the database
    e. Reload from the database
    

We are considering running these operations on multiple machines. One of the options is to use Spark, but it is not clear how to properly support data serialization and distribute/update the dependencies.

Even if we are able to separate the logic in the objects from the data (making objects easily serializable) the functions we want to run over the objects will still rely on the complex dependencies mentioned above.

Additionally, at least at the moment, we don't have plans to use any operations requiring shuffling of the data between machines and all we need is basically sharding.

Does Spark look like a good fit for such scenario?

  • If yes, how to handle the complex dependencies?
  • If no, would appreciate any pointers to alternative systems that can handle the workflow.

Solution

  • I don't understand enough what you mean by "complex interdependencies" but it seems that if you only need sharding you won't really get much from spark - just run multiple whatever you have an use a queue to synchronize the work and distribute to each copy the shard it needs to work on.

    We did something similar converting a pySpark jot to a Kubernetes setup where the queue holds the list of ids and then we have multiple pods (we control the scale via kubectl) that read from that queue and got much better performance and simpler solution - see https://kubernetes.io/docs/tasks/job/coarse-parallel-processing-work-queue/