Search code examples
scalaapache-sparkdataframescientific-notation

How to disable scientific notation in spark-xml


How to disable to scientific notation while reading the xml file using databrick spark-xml library. Here is scenario, my XML file contain numeric value with space like this

<V1>42451267 </V1>

and what I'm getting 4.2451267E7 instead of 42451267

How can I fix it

My code and xml file are below

val xmlLocation = "sampleFile/xml/sample.xml"
val rootTag = "RTS"
val rowTag = "COLUMNTYPE"
val sqlContext = MySparkDriver.getSqlContext().
  read.format("com.databricks.spark.xml")
if (rootTag != null && rootTag.size == 0)
  sqlContext.option("rootTag", rootTag)
sqlContext.option("rowTag", rowTag)
val xmlDF = sqlContext.load(xmlLocation)
xmlDF.show(false)

output

[WrappedArray(4232323.0, 4.2451267E7),21-11-2000 01:04:34,NTS,212212112,100.0,100.0]

expected

[WrappedArray(4232323, 42451267),21-11-2000 01:04:34,NTS,212212112,100.0000,100.0000]

XML file

<RTS>
<COLUMNTYPE>
    <D1>
        <V1>4232323</V1>
        <V1>42451267 </V1>
        <V2>21-11-2000 01:04:34</V2>
        <V3>NTS</V3>
        <V4>212212112</V4>
        <V7>100.0000</V7>
        <V8>100.0000 </V8>

    </D1>
</COLUMNTYPE>
</RTS>

Any help would be much appreciated .


Solution

  • isLong function of TypeCast class not able to predict datatype because your value "42451267 " contain space

    However,If you want to treated as a long value defined your own custom schema where "V1" column data type is StringType

    val xmlLocation = "sampleFile/xml/sample.xml"
    val rootTag = "RTS"
    val rowTag = "COLUMNTYPE"
    val sqlContext = MySparkDriver.getSqlContext().
      read.format("com.databricks.spark.xml")
    if (rootTag != null && rootTag.size == 0)
      sqlContext.option("rootTag", rootTag)
    sqlContext.option("rowTag", rowTag)
    

    Custom schema

    val customSchema = StructType(Array(
      StructField("D1", StructType(
        Seq(StructField("V1", ArrayType(StringType, true), true),
          StructField("V2", StringType, true),
          StructField("V3", StringType, true),
          StructField("V4", LongType, true),
          StructField("V7", DoubleType, true),
          StructField("V8", DoubleType, true))), true)))
    sqlContext.schema(customSchema)
    

    Create the udf for trim values

    import org.apache.spark.sql.functions._
    val toTrim = udf((xs: Seq[String]) => xs.map(_.trim()))
    

    apply udf and type cast to long

    val xmlDF = sqlContext.load(xmlLocation).select(struct(
      toTrim(col("D1.V1")).cast("array<long>").alias("V1"),
      col("D1.V2"), col("D1.V3"), col("D1.V4"), col("D1.V7"), col("D1.V8"))
      .alias("D1"))
    xmlDF.printSchema
    xmlDF.show(false)