Search code examples
scalaapache-sparkcollectionstreemap

Load file into Map keep original line order


I need to load a lookup CSV file that will be used to apply some regex rules (key, value) to strings. These rules need to be applied in the order they appear in the file.

Loading it into a Map doesn't guarantee the order is kept.

Is there a way to load the CSV file into a structure like a TreeMap (or other) while maintaining the file row order?

I would like to avoid to hardcode an index/key directly into the file (that would be a possible solution, but would make maintaining the CSV dictionaries harder). Perhaps there's a way to generate an index "on the fly" while loading it?

val vMap = sparkSession.read.option( "header", true ).csv( pPath )
      .rdd.map(x => (x.getString(0), x.getString(1)))  
      .collectAsMap()

So having some "rules" like :

(ab)cd, $1

(ab)cde, $1

(ab),$1

(ab)cdfgh,$1

(ff)gt,$1

I would like in the end to be able to have a collection that I could iterate upon, following that same order, preferably with a foreach method. What I get now is a random Map that will be iterated each time in a different order.

Edit: forgot to mention i am using scala version 2.11.12, which came packed into the latest spark release.

Possible solution (based on user6337 answer)

So reading the answer and thinkering with it, got to this piece of code.

var myMap = new mutable.LinkedHashMap[String, String]()
val vList = sparkSession.read.option( "header", true ).csv( pPath )
      .collect().map( t => myMap += ((t(0).toString, t(1).toString)))

myMap.foreach( x => println(x._1+ " - "+x._2) 

My new concern is if this reading of a Dataframe is enough to guarantee file line ordering.


Solution

  • Use a LinkedHashMap, which preserves the order in which the items were added to the LinkedHashMap.

    Here is some example code

    import scala.collection.mutable
    
    object Main extends App {
    
      val myList = List(("1", "a"),("2","b"),("3","c"),("4","d"))
      println(myList)
    
      val myMap = mutable.LinkedHashMap[String, String]()
    
      myMap.addAll(myList)
    
      myMap.foreach(println)
    }
    

    Running this code prints

    List((1,a), (2,b), (3,c), (4,d))
    (1,a)
    (2,b)
    (3,c)
    (4,d)
    

    which is what you want.

    So first convert your data into a collection like a List or a Vector, and then load it into your mutable LinkedHashMap using addAll. Order will be preserved when you use map or foreach on the LinkedHashMap.