Search code examples
apache-sparkspark-graphx

How to Convert a Collection of a String array to String/Text in spark using scala


Code from Apache Spark GrpahX gives me results:

Array[(org.apache.spark.graphx.VertexId, Array[org.apache.spark.graphx.VertexId])] = Array((4,Array(17, 18, 20)), (16,Array(20)), (14,Array()), (6,Array(7)), (8,Array(9, 10)), (12,Array(1)), (20,Array(16, 19)), (18,Array()), (10,Array()), (2,Array(4, 15, 16)), (19,Array(4)), (13,Array()), (15,Array()), (11,Array(1)), (1,Array(5, 8)), (17,Array(4)), (3,Array(1, 8, 13, 14)), (7,Array(5)), (9,Array(5, 8)), (5,Array(1, 6, 7, 8)))

After saveAsTextFile:

(16,[J@4ee106a0)
(20,[J@6d1dcef6)
(13,[J@4c3850da)
(3,[J@7e97b33a)
(8,[J@7c0ad5d1)
(2,[J@321e8c0d)
(1,[J@7964eb06)
(5,[J@172243cb)
(14,[J@519adbc6)
(18,[J@1154e795)
(15,[J@16175a92)
(7,[J@5fc8c46b)
(4,[J@6996f848)
(12,[J@34e6faa9)
(19,[J@6aec10c5)
(17,[J@69a45e4d)
(6,[J@6a94d262)
(10,[J@3c4a02cd)
(11,[J@7081d0e4)
(9,[J@78269e87)

How may I convert this array to save it in readable way like:

(4: (17, 18, 20)) 

or something like this


Solution

  • Converting a Collection to a String with mkString() function:

    scala> val records = Array((4,Array(17, 18, 20)), (16,Array(20)), (14,Array()))
    records: Array[(Int, Array[_ <: Int])] = Array((4,Array(17, 18, 20)), (16,Array(20)), (14,Array()))
    
    scala> val recordsRDD = sc.parallelize(records)
    recordsRDD: org.apache.spark.rdd.RDD[(Int, Array[_ <: Int])] = ParallelCollectionRDD[0] at parallelize at <console>:14
    
    scala> recordsRDD.map(rec => "(" + rec._1 + ": (" + rec._2.mkString(",") + "))").collect().foreach(println)
    (4: (17,18,20))
    (16: (20))
    (14: ())
    

    The mkString method is overloaded, so you can also add a prefix and suffix:

      val a = Array("apple", "banana", "cherry")
      a.mkString("[", ", ", "]") 
      res4: String = [apple, banana,cherry]
    

    scala> recordsRDD.map(rec => "(" + rec._1 + ": (" + rec._2.mkString(",") + "))").saveAsTextFile("/user/cloudera/col_toString1")
    scala> recordsRDD.map(rec => "(" + rec._1 +  rec._2.mkString(": (", ", ", ")") + ")").saveAsTextFile("/user/cloudera/col_toString2")
    -----
    [cloudera@quickstart ~]$ hadoop fs -cat /user/cloudera/col_toString1/p*
    (4: (17,18,20))
    (16: (20))
    (14: ())
    [cloudera@quickstart ~]$ hadoop fs -cat /user/cloudera/col_toString2/p*
    (4: (17, 18, 20))
    (16: (20))
    (14: ())