Search code examples
scalaapache-sparkapache-spark-1.6

Delete Unicode value in output of Spark 1.6 using Scala


The file generated from API contains data like below

col1,col2,col3
503004,(d$üíõ$F|'.h*Ë!øì=(.î;      ,.¡|®!®3-2-704

when i am reading in spark it is appearing like this. i am using case class to read from RDD then convert it to DataFrame using .todf.

503004,������������,������������������������3-2-704

but i am trying to get value like

503004,dFh,3-2-704-- only alpha-numeric value is retained.

i am using spark 1.6 and scala.

Please share your suggestion


Solution

  • #this ca be achieved by using the regex_replace
        val df = spark.sparkContext.parallelize(List(("503004","d$üíõ$F|'.h*Ë!øì=(.î;      ,.¡|®!®","3-2-704"))).toDF("col1","col2","col3")
        df.withColumn("col2_new", regexp_replace($"col2", "[^a-zA-Z]", "")).show()    
    Output:
    +------+--------------------+-------+--------+
    |  col1|                col2|   col3|col2_new|
    +------+--------------------+-------+--------+
    |503004|d$üíõ$F|'.h*Ë!øì=...|3-2-704|     dFh|
    +------+--------------------+-------+--------+