Search code examples
securityapache-sparkbuildshared-librariesexternal

Spark security consideration


I have some extra security considerations from a normal job. I usually use sbt to build and I will give it some libraries to grab from a Maven repository. But now, I'm unable to use a lot of external libraries, and I'm unsure at this point if I will be able to go out to Maven to get the Spark libraries that I might need. Even if I were to get the external libraries, there would be a vetting process that would take months for each library. Has anyone been in a similar situation? From the standpoint of not being able to use external libraries, can anyone share what they did to have a successful suite of Spark jobs to do there data munging and data science on a hadoop cluster?


Solution

  • I think there isn't a standard solution for your problem within the context you exposed. It depends on how much you go with external dependencies and what you really need. And I give you an example: parsing csv rows and construct dataframe/datasets or rdd. You have plenty of options:

    • use external library (from databricks or others)
    • rely on your code and do it by hand, so no external dependency
    • rely on spark newer versions that knows how to deal with csv

    If you have a hadoop cluster than all the spark runtime environment already contains plenty of libraries that will be loaded (json manipulation, networking, logging, just to name a few). Most of your business logic inside your spark jobs can be done with those. I give you some examples on how I have approached the problem with external dependency although I did'n have any security constraints. In one case we had to use Spring dependency within our Spark application (cause we wanted to update some relation tables), so we got a fat jar with all spring dependencies and they were many. Conclusion: got a lot of dependency for nothing (horror maintaining it :) ). So that was not a good approach. In other case we had to do the same thing, but then we kept the dependency at minimum (the most simple thing that can read/update a table with a jdbc). Conclusion: the fat jar was not that big, we kept only what was really needed, nothing more nothing less.

    Spark already provides you with a lot of functionalities. Knowing a external library that can do something does not mean that spark can't do it with what is has.