Search code examples
scalamavenunit-testingapache-sparkintellij-idea

Why are my Scala+Spark app Unit Tests run so much faster in IntelliJ vs a regular mvn clean test run?


I have a whole set of UTs that perform quite a lot of Spark operations. I've noticed that when I run the test set in InteliiJ IDEA, it finishes in about 10 minutes. When I proceed to build using maven, the process takes almost an hour. If I just run the maven test goal, it takes a over 50 minutes, so most of the time is in the UTs execution.

I compared the execution logs between IntelliJ vs Maven execution and they are all the same (obviously the order of the parallel operations is not), so the execution is functionally equivalent. I'm not sure what to do to find what's causing this huge performance drop when UTs run in Maven.

An example of the time differences using the time reported in logs (grouping and discarding identical lines/times) in one of the test.

Maven: 102 seconds

12:34:50 [ScalaTest...
12:35:01 [ScalaTest...
12:35:19 [Executor...
12:35:20 [ScalaTest...
12:35:25 [ScalaTest...
12:36:06 [Executor...
12:36:08 [ScalaTest...
12:36:16 [ScalaTest...
12:36:24 [ScalaTest...
12:36:32 [ScalaTest...

IntelliJ: 26 seconds

12:49:53 [ScalaTest...
12:49:58 [ScalaTest...
12:50:04 [Executor...
12:50:04 [ScalaTest...
12:50:07 [ScalaTest...
12:50:13 [Executor...
12:50:14 [ScalaTest...
12:50:16 [ScalaTest...
12:50:18 [ScalaTest...
12:50:19 [ScalaTest...

I see the same pattern in every other test where Spark operations are performed. Sometimes the time difference is almost 10x between environments and averaged between all tests, it's around 5x. Seems like a lot of the waits happen when switching to parallel execution in nodes. Any idea how to identify the configuration settings that may cause this? Any spark setting I can apply to have both environments running with similar processing times?

I have already tried with reducing partitions and setting the spark.sql.shuffle.partitions to low values (1, 2, 3...), but I don't see any difference.

EDIT: I started to play around with surefire, scalatest and maven memory settings.

        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-surefire-plugin</artifactId>
                <configuration>
                    <argLine>-Xmx8G -XX:MaxPermSize=4048M</argLine>
                    <forkCount>1</forkCount>
                    <reuseForks>true</reuseForks>
                    ...

I thought this improved somehow, but measuring times, it didn't help either.

Thanks!


Solution

  • Ok, so I managed to make the tests run even quicker than IntelliJ by changing these settings in the pom.xml file. Surefire test runner was not used for Scala UTs.

    <plugin>
                    <groupId>org.scalatest</groupId>
                    <artifactId>scalatest-maven-plugin</artifactId>
                    <configuration>
                        <reportsDirectory>${project.build.directory}/surefire-reports</reportsDirectory>
                        <junitxml>.</junitxml>
                        <filereports>WDF TestSuite.txt</filereports>
                        <argLine>-Xss1G -Xms4G -Xmx8G -XX:ReservedCodeCacheSize=2G</argLine>
                        <systemProperties>
                            <spark.testing>1</spark.testing>
                            <spark.sql.shuffle.partitions>4</spark.sql.shuffle.partitions>
                        </systemProperties>
                    </configuration>
                    <executions>
                        <execution>
                            <id>test</id>
                            <goals>
                                <goal>test</goal>
                            </goals>
                        </execution>
                    </executions>
                </plugin>
    

    Curiously, setting 4 partitions on shuffle in code didn't provide any time improvement, but setting it here did help a lot. That, combined with increased memory for scalatest plugin did the trick.