Search code examples
apache-sparkspark-streamingsnappy

How to configure Executor in Spark Local Mode


In Short

I want to configure my application to use lz4 compression instead of snappy, what I did is:

session = SparkSession.builder()
        .master(SPARK_MASTER) //local[1]
        .appName(SPARK_APP_NAME)
        .config("spark.io.compression.codec", "org.apache.spark.io.LZ4CompressionCodec")
        .getOrCreate();

but looking at the console output, it's still using snappy in the executor

org.apache.parquet.hadoop.codec.CodecConfig: Compression: SNAPPY

and

[Executor task launch worker-0] compress.CodecPool (CodecPool.java:getCompressor(153)) - Got brand-new compressor [.snappy]

According to this post, what I did here only configure the driver, but not the executor. The solution on the post is to change the spark-defaults.conf file, but I'm running spark in local mode, I don't have that file anywhere.

Some more detail:

I need to run the application in local mode (for the purpose of unit test). The tests works fine locally on my machine, but when I submit the test to a build engine(RHEL5_64), I got the error

snappy-1.0.5-libsnappyjava.so: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.9' not found

I did some research and it seems the simplest fix is to use lz4 instead of snappy for codec, so I try the above solution.

I have been stuck in this issue for several hours, any help is appreciated, thank you.


Solution

  • Posting my solution here, @user8371915 does answered the question, but did not solve my problem, because in my case I can't modified the property files.

    What I end up doing is adding another configuration

    session = SparkSession.builder()
            .master(SPARK_MASTER) //local[1]
            .appName(SPARK_APP_NAME)
            .config("spark.io.compression.codec", "org.apache.spark.io.LZ4CompressionCodec")
            .config("spark.sql.parquet.compression.codec", "uncompressed")
            .getOrCreate();