Search code examples
linuxbashmavengrepspark-submit

How can I transform top level dependencies from mvn dependency:tree into a list of Maven coordinates using bash?


To enable creating a spark submit command for my applications without creating uber-jars, I want to create a comma separated list of maven coordinates of the applications top level dependencies during my build process, which I can then use in spark-submit with --packages= (or spark.jars.packages=).

This list can be retrieved using `mvn dependency:tree' which outputs a list with this format:

[INFO] com.myorg:my-project:jar:1.0-SNAPSHOT
[INFO] +- org.scala-lang:scala-library:jar:2.11.12:compile
[INFO] +- org.scala-lang:scala-compiler:jar:2.11.12:compile
[INFO] |  \- org.scala-lang.modules:scala-parser-combinators_2.11:jar:1.0.4:compile
[INFO] +- com.fasterxml.jackson.core:jackson-annotations:jar:2.9.10:compile
[INFO] +- io.circe:circe-config_2.11:jar:0.6.1:compile
[INFO] |  +- com.typesafe:config:jar:1.3.3:compile
[INFO] |  +- io.circe:circe-core_2.11:jar:0.11.1:compile
[INFO] |  |  +- io.circe:circe-numbers_2.11:jar:0.11.1:compile
[INFO] |  |  \- org.typelevel:cats-core_2.11:jar:1.5.0:compile
[INFO] |  |     +- org.typelevel:cats-kernel_2.11:jar:1.5.0:compile
[INFO] |  |     \- org.typelevel:machinist_2.11:jar:0.6.6:compile
[INFO] +- org.scalatest:scalatest_2.11:jar:3.0.8:test
[INFO] |  \- org.scalactic:scalactic_2.11:jar:3.0.8:test
[INFO] \- org.mock-server:mockserver-netty:jar:5.6.1:test
[INFO]    +- org.mock-server:mockserver-client-java:jar:5.6.1:test
[INFO]    +- org.mock-server:mockserver-core:jar:5.6.1:test
[INFO]    |  +- io.netty:netty-codec-socks:jar:4.1.35.Final:test
[INFO]    |  +- com.github.java-json-tools:json-schema-validator:jar:2.2.10:test
[INFO]    |  |  +- javax.mail:mailapi:jar:1.4.3:test
[INFO]    |  |  +- com.googlecode.libphonenumber:libphonenumber:jar:8.0.0:test
[INFO]    |  |  \- net.sf.jopt-simple:jopt-simple:jar:5.0.3:test
[INFO]    |  +- com.jayway.jsonpath:json-path:jar:2.4.0:test
[INFO]    |  |  \- net.minidev:json-smart:jar:2.3:test
[INFO]    |  |     \- net.minidev:accessors-smart:jar:1.2:test
[INFO]    |  |        \- org.ow2.asm:asm:jar:5.0.4:test
[INFO]    |  +- org.apache.commons:commons-text:jar:1.3:test
[INFO]    |  \- org.apache.commons:commons-collections4:jar:4.2:test
[INFO]    +- io.netty:netty-buffer:jar:4.1.35.Final:test
[INFO]    +- io.netty:netty-handler:jar:4.1.35.Final:test
[INFO]    \- io.netty:netty-transport:jar:4.1.35.Final:test
[INFO]       \- io.netty:netty-resolver:jar:4.1.35.Final:test

Note that the top level dependencies are preceded by "[INFO] +- " (with a single space after the '-').

Only the ":jar:" dependencies are relevant and out of those only the ":compile" dependencies.

I want to only output the lines that meet all the following coniditions:

  • starting with "[INFO] +- "
  • containing ":jar:"
  • containing ":compile"

and from these extract the orginization:package:version like so: org.scala-lang:scala-library:jar:2.11.12:compile ==> org.scala-lang:scala-library:2.11.12

then concatenate these outputs delimited by commas (,).


Solution

  • The following solution worked for me:

    mvn dependency:tree | grep -e '^\[.*\I\N\F\O.*\][[:space:]]+-[[:space:]]' | grep -e ':\j\a\r:' | grep -e ':\c\o\m\p\i\l\e$' | cut -d ' ' -f3 | sed 's/:jar:/:/g' | sed 's/:compile//g' | paste -sd ','

    This takes into account escaping special characters like spaces and brackets that normally interfere with grep.

    The grep commands do the string filtering, the cut command tokenizing and selecting column, the sed commands replacing strings and the paste command for concatenation.