Search code examples
javascaladelta

Delta encoders: Using Java library in Scala


I have to compare using Spark-based big data analysis data sets (text files) that are very similar (>98%) but with very large sizes. After doing some research, I found that most efficient way could be to use delta encoders. With this I can have a reference text and store others as delta increments. However, I use Scala that does not have support for delta encoders, and I am not at all conversant with Java. But as Scala is interoperable with Java, I know that it is possible to get Java lib work in Scala.

I found the promising implementations to be xdelta, vcdiff-java and bsdiff. With a bit more searching, I found the most interesting library, dez. The link also gives benchmarks in which it seems to perform very well, and code is free to use and looks lightweight.

At this point, I am stuck with using this library in Scala (via sbt). I would appreciate any suggestions or references to navigate this barrier, either specific to this issue (delata encoders), library or in working with Java API in general within Scala. Specifically, my questions are:

  1. Is there a Scala library for delta encoders that I can directly use? (If not)

  2. Is it possible that I place the class files/notzed.dez.jar in the project and let sbt provide the APIs in the Scala code?

I am kind of stuck in this quagmire and any way out would be greatly appreciated.


Solution

  • There are several details to take into account. There is no problem in using directly the Java libraries in Scala, either using as dependencies in sbt or using as unmanaged dependencies https://www.scala-sbt.org/1.x/docs/Library-Dependencies.html: "Dependencies in lib go on all the classpaths (for compile, test, run, and console)". You can create a fat jar with your code and dependencies with https://github.com/sbt/sbt-native-packager and distributed it with Spark Submit.

    The point here is to use these frameworks in Spark. To take advantage of Spark you would need split your files in blocks to distribute the algorithm across the cluster for one file. Or if your files are compressed and you have each of them in one hdfs partition you would need to adjust the size of the hdfs blocks, etc ...

    You can use the C modules and include them in your project and call them via JNI as frameworks like deep learning frameworks use the native linear algebra functions, etc. So, in essence, there are a lot to discuss about how to implement these delta algorithms in Spark.