apache-spark apache-spark-sql spark-java

Spark reading json number as string

I'm working on spark java application and using spark 2.4.7 version. I have a json file that I'm loading into dataframe like

Dataset<Row> df = sparkSession().read().option("multiline",true).format(json).load(path_of_json);

The issue is that in my json file I have an attribute whose value is in number but when I printSchema() of the dataframe it is showing that attribute as StringType and not LongType.

Json file-

[
{"first": {
      "id" :"fdfd",
      "name":"temp",
      "type":-1       --> reading it as LongType
       },
 "something":"something_else",
 "data" : {
      "key": {
          "field":7569,   --> reading it as StringType
          "temp":"dfdfd"
       }
    }
}]

I tried reproducing the issue in my local spark shell but it is working fine there. Anyone has an idea why is it happening?

Solution

By default, Spark tries to infer the schema automatically when reading from a Json file data source. However, if you know it, you can specify the schema when loading the Dataframe.

You first need to define the schema, an instance of the StructType class, where you specify each field name and data type. You can do it manually:

StructType keyType = new StructType()
    .add(new StructField("field", DataTypes.LongType, true, Metadata.empty()))
    .add(new StructField("temp", DataTypes.StringType, true, Metadata.empty()));

StructType dataType = new StructType()
    .add(new StructField("key", keyType, true, Metadata.empty()));

StructType firstType = new StructType()
    .add(new StructField("id", DataTypes.StringType, true, Metadata.empty()))
    .add(new StructField("name", DataTypes.StringType, true, Metadata.empty()))
    .add(new StructField("type", DataTypes.LongType, true, Metadata.empty()));

StructType schema = new StructType()
    .add(new StructField("data", dataType, true, Metadata.empty()))
    .add(new StructField("first", firstType, true, Metadata.empty()))
    .add(new StructField("something", DataTypes.StringType, true, Metadata.empty()));

or from a DDL string:

StructType schema = StructType.fromDDL("data STRUCT<key: STRUCT<field: BIGINT, temp: STRING>>,first STRUCT<id: STRING, name: STRING, type: BIGINT>,something STRING"));

Then specify the schema when loading the Dataframe:

Dataset<Row> df = spark.read()
        .option("multiline", true)
        .format("json")
        .schema(schema)
        .load(jsonPath);