Search code examples
apache-sparkpackagegraphframes

Installation of graphframes package in an offline Spark cluster


I have an offline pyspark cluster (no internet access) where I need to install graphframes library.

I have manually downloaded the jar from here added in $SPARK_HOME/jars/ and then when I try to use it I get the following error:

error: missing or invalid dependency detected while loading class file 'Logging.class'.
Could not access term typesafe in package com,
because it (or its dependencies) are missing. Check your build definition for
missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see the problematic classpath.)
A full rebuild may help if 'Logging.class' was compiled against an incompatible version of com.
error: missing or invalid dependency detected while loading class file 'Logging.class'.
Could not access term scalalogging in value com.typesafe,
because it (or its dependencies) are missing. Check your build definition for
missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see the problematic classpath.)
A full rebuild may help if 'Logging.class' was compiled against an incompatible version of com.typesafe.
error: missing or invalid dependency detected while loading class file 'Logging.class'.
Could not access type LazyLogging in value com.slf4j,
because it (or its dependencies) are missing. Check your build definition for
missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see the problematic classpath.)
A full rebuild may help if 'Logging.class' was compiled against an incompatible version of com.slf4j.

Which is the correct way to offline install it with all the dependencies?


Solution

  • I manage to install the graphframes libarary. First of all I found the graphframes dependencies witch where:

    scala-logging-api_xx-xx.jar
    scala-logging-slf4j_xx-xx.jar
    

    where xx is the proper versions for scala and the jar version. Then I installed them in the proper path. Because I work in an Cloudera machine the proper path is:

    /opt/cloudera/parcels/SPARK2/lib/spark2/jars/

    If you can not place them in this directory in your cluster (because you have no root rights and your admin is super lazy) you can simply add in your spark-submit/ spark-shell

    spark-submit ..... --driver-class-path /path-for-jar/  \
                       --jars /../graphframes-0.5.0-spark2.1-s_2.11.jar,/../scala-logging-slf4j_2.10-2.1.2.jar,/../scala-logging-api_2.10-2.1.2.jar
    

    This works for Scala. In order to use graphframes for python you need to download graphframes jar and then through shell

    #Extract JAR content
     jar xf graphframes_graphframes-0.3.0-spark2.0-s_2.11.jar
    #Enter the folder
     cd graphframes
    #Zip the contents
     zip graphframes.zip -r *
    

    And then add the zipped file in your python path in spark-env.sh or your bash_profile with

    export PYTHONPATH=$PYTHONPATH:/..proper path/graphframes.zip:.
    

    Then opening the shell/submitting (again with the same arguments as with scala) importing graphframes works normaly

    This link was extremely useful for this solution