Search code examples
scalaapache-sparkapache-spark-sqlrddspark-csv

Prevent delimiter collision while reading csv in Spark


I'm trying to create an RDD using a CSV dataset.

The problem is that I have a column location that has a structure like (11112,222222) that I dont use.

So when I use the map function with split(",") its resulting in two columns.

Here is my code :

     val header = collisionsRDD.first 

     case class Collision (date:String,time:String,borogh:String,zip:String,
      onStreet:String,crossStreet:String,                                  
      offStreet:String,numPersInjured:Int,
      numPersKilled:Int,numPedesInjured:Int,numPedesKilled:Int,
      numCyclInjured:Int,numCycleKilled:Int,numMotoInjured:Int)   


     val collisionsPlat = collisionsRDD.filter(h => h != header).
                map(x => x.split(",").map(x => x.replace("\"","")))

     val collisionsCase = collisionsPlat.map(x => Collision(x(0),
                                x(1), x(2), x(3),                  
                                x(8), x(9), x(10),
                                x(11).toInt,x(12).toInt,
                                x(13).toInt,x(14).toInt,
                                x(15).toInt,x(16).toInt,
                                x(17).toInt))
     collisionsCase.take(5)                                                  

How can I catch the , inside this field and not consider it as a CSV delimiter?


Solution

  • Use spark-csv to read the file because it has the option quote enabled

    For Spark 1.6 :

    sqlContext.read.format("com.databticks.spark.csv").load(file)
    

    or for Spark 2 :

    spark.read.csv(file)
    

    From the Docs:

    quote: by default the quote character is ", but can be set to any character. Delimiters inside quotes are ignored

    $ cat abc.csv
    a,b,c
    1,"2,3,4",5
    5,"7,8,9",10
    
    scala> case class ABC (a: String, b: String, c: String)
    
    scala> spark.read.option("header", "true").csv("abc.csv").as[ABC].show
    +---+-----+---+
    |  a|    b|  c|
    +---+-----+---+
    |  1|2,3,4|  5|
    |  5|7,8,9| 10|
    +---+-----+---+