Search code examples
scalaapache-sparkspark-graphxsparklyr

Reading graph from file


Looking to run a GraphX example on my Windows machine using Spark-Shell from SparklyR install of Hadoop/Spark. Am able to launch the shell from the install directory here first:

start C:\\Users\\eyeOfTheStorm\\AppData\\Local\\rstudio\\spark\\Cache\\spark-2.0.0-bin-hadoop2.7\\bin\\spark-shell 

Output:

17/01/02 12:21:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/01/02 12:21:07 WARN SparkContext: Use an existing SparkContext, some configuration may not take effect.
Spark context Web UI available at http://192.168.99.1:4040
Spark context available as 'sc' (master = local[*], app id = local-1483388466798).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.0
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) Client VM, Java 1.8.0_111)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

Then using this text example from SPARK IN ACTION as Cit-Hepth.txt saved in C:\Users\eyeOfTheStorm with this data for example use:

"V1"    "V2"
1001    9304045
1001    9308122
1001    9309097
1001    9311042
1001    9401139
1001    9404151
1001    9407087
1001    9408099
1001    9501030
1001    9503124
1001    9504090

Then I simply run val graph = GraphLoader.edgeListFile(sc, "Cit-HepTh.txt") from the Scala shell, and get the below errors. Note, the HADOOP_HOME is automatically set by SparklyR with the correct winutils installed in C:\Users\eyeOfTheStorm\AppData\Local\rstudio\spark\Cache\spark-2.0.0-bin-hadoop2.7\tmp\hadoop. Is there a missing piece of code or a path which would eliminate the errors below and run the code?

scala> val graph = GraphLoader.edgeListFile(sc, "Cit-HepTh.txt")
17/01/02 12:41:48 WARN BlockManager: Putting block rdd_5_0 failed
17/01/02 12:41:48 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NumberFormatException: For input string: ""V1""
        at java.lang.NumberFormatException.forInputString(Unknown Source)
        at java.lang.Long.parseLong(Unknown Source)
        at java.lang.Long.parseLong(Unknown Source)
        at scala.collection.immutable.StringLike$class.toLong(StringLike.scala:276)
        at scala.collection.immutable.StringOps.toLong(StringOps.scala:29)
        at org.apache.spark.graphx.GraphLoader$$anonfun$1$$anonfun$apply$1.apply(GraphLoader.scala:83)
        at org.apache.spark.graphx.GraphLoader$$anonfun$1$$anonfun$apply$1.apply(GraphLoader.scala:77)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
        at org.apache.spark.graphx.GraphLoader$$anonfun$1.apply(GraphLoader.scala:77)
        at org.apache.spark.graphx.GraphLoader$$anonfun$1.apply(GraphLoader.scala:75)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:801)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:801)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
        at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:332)
        at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:330)
        at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:919)
        at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:910)
        at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
        at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:910)
        at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:668)
        at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
        at org.apache.spark.scheduler.Task.run(Task.scala:85)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)
17/01/02 12:41:48 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.NumberFormatException: For input string: ""V1""
        at java.lang.NumberFormatException.forInputString(Unknown Source)
        at java.lang.Long.parseLong(Unknown Source)
        at java.lang.Long.parseLong(Unknown Source)
        at scala.collection.immutable.StringLike$class.toLong(StringLike.scala:276)
        at scala.collection.immutable.StringOps.toLong(StringOps.scala:29)
        at org.apache.spark.graphx.GraphLoader$$anonfun$1$$anonfun$apply$1.apply(GraphLoader.scala:83)
        at org.apache.spark.graphx.GraphLoader$$anonfun$1$$anonfun$apply$1.apply(GraphLoader.scala:77)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
        at org.apache.spark.graphx.GraphLoader$$anonfun$1.apply(GraphLoader.scala:77)
        at org.apache.spark.graphx.GraphLoader$$anonfun$1.apply(GraphLoader.scala:75)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:801)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:801)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
        at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:332)
        at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:330)
        at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:919)
        at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:910)
        at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
        at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:910)
        at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:668)
        at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
        at org.apache.spark.scheduler.Task.run(Task.scala:85)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)

17/01/02 12:41:48 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
17/01/02 12:41:48 WARN BlockManager: Putting block rdd_5_1 failed
17/01/02 12:41:48 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, localhost): TaskKilled (killed intentionally)
[Stage 0:>                                                          (0 + 1) / 2]org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.NumberFormatException: For input string: ""V1""
        at java.lang.NumberFormatException.forInputString(Unknown Source)
        at java.lang.Long.parseLong(Unknown Source)
        at java.lang.Long.parseLong(Unknown Source)
        at scala.collection.immutable.StringLike$class.toLong(StringLike.scala:276)
        at scala.collection.immutable.StringOps.toLong(StringOps.scala:29)
        at org.apache.spark.graphx.GraphLoader$$anonfun$1$$anonfun$apply$1.apply(GraphLoader.scala:83)
        at org.apache.spark.graphx.GraphLoader$$anonfun$1$$anonfun$apply$1.apply(GraphLoader.scala:77)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
        at org.apache.spark.graphx.GraphLoader$$anonfun$1.apply(GraphLoader.scala:77)
        at org.apache.spark.graphx.GraphLoader$$anonfun$1.apply(GraphLoader.scala:75)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:801)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:801)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
        at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:332)
        at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:330)
        at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:919)
        at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:910)
        at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
        at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:910)
        at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:668)
        at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
        at org.apache.spark.scheduler.Task.run(Task.scala:85)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)

Driver stacktrace:
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
  at scala.Option.foreach(Option.scala:257)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1871)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1884)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1897)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1911)
  at org.apache.spark.rdd.RDD.count(RDD.scala:1115)
  at org.apache.spark.graphx.GraphLoader$.edgeListFile(GraphLoader.scala:94)
  ... 50 elided
Caused by: java.lang.NumberFormatException: For input string: ""V1""
  at java.lang.NumberFormatException.forInputString(Unknown Source)
  at java.lang.Long.parseLong(Unknown Source)
  at java.lang.Long.parseLong(Unknown Source)
  at scala.collection.immutable.StringLike$class.toLong(StringLike.scala:276)
  at scala.collection.immutable.StringOps.toLong(StringOps.scala:29)
  at org.apache.spark.graphx.GraphLoader$$anonfun$1$$anonfun$apply$1.apply(GraphLoader.scala:83)
  at org.apache.spark.graphx.GraphLoader$$anonfun$1$$anonfun$apply$1.apply(GraphLoader.scala:77)
  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
  at org.apache.spark.graphx.GraphLoader$$anonfun$1.apply(GraphLoader.scala:77)
  at org.apache.spark.graphx.GraphLoader$$anonfun$1.apply(GraphLoader.scala:75)
  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:801)
  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:801)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
  at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:332)
  at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:330)
  at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:919)
  at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:910)
  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
  at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:910)
  at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:668)
  at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
  at org.apache.spark.scheduler.Task.run(Task.scala:85)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
  at java.lang.Thread.run(Unknown Source)

Solution

  • Input for GraphLoader.edgeListFile should be:

    edge list formatted file where each line contains two integers: a source id and a target id.

    No other values like headers or attributes are allowed. You can either strip header line manually or use alternative loading method for example csv reader:

    val spark: SparkSession = ???    
    import spark.implicits._
    
    val path: String = ???
    
    Graph.fromEdgeTuples(
      spark.read
        // Adjust separator if needed
        .options(Map("header" -> "true", "delimiter" -> "\t"))
        .csv(path)
        .select($"V1".cast("long"), $"V2".cast("long"))
        .as[(Long, Long)]
        .rdd,
      defaultValue = 0
    )
    

    You could also use GraphFrames:

    import org.graphframes.GraphFrame
    
    GraphFrame.fromEdges(spark.read
      .options(Map("header" -> "true", "delimiter" -> "\t"))
      .csv(path)
      .toDF("src", "dst")
    ).toGraphX