Search code examples
scalaapache-sparkelasticsearch

Spark ElasticSearch query missing fields in dataframe


A customer recently upgraded its ElasticSearch version to 7.13.3 and I noticed that I have some issues while retrieving documents from a particular index. What I do is to query the index on a particular field and the resulting dataframe is missing some fields. This happens in a java spark etl job, but I was able to reproduce the issue on my environment.

Here's a sample of the document

{
"_index" : "myIndex",
        "_type" : "_doc",
        "_id" : "id",
        "_score" : 1.0,
        "_source" : {
          "field1" : true,
          "field2" : "namefile.xml",
          "fieldToQuery": "valueToQuery",
          "document" : {
            "nestedField1" : {
              "field3" : "filename.xml",
            
            },

            "nestedField2": {
                "anotherNestedField": {...},
                ...
                ...
            },
            "nestedField3": {
               "theMissingNestedField": {
                "fieldX": "xxxxx",
                "fieldY": "yyyyy"
}
            }
        }
    }
}

I want to query documents that have "valueToQuery" on the field "fieldToQuery".

val query = """{"query": {"bool": {"must": [{"term": {"fieldToQuery": "valueToQuery"}}] }}} """

val df = spark.read.format("org.elasticsearch.spark.sql").option("es.nodes", "myEsServer:9200").option("es.nodes.wan.only",true).option("query", query).load("myIndex")

the query is correct and goes on, the problem is that some fields are missing. If i print the schema i get the following

scala> df.printSchema()
root
 |-- field1: boolean (nullable = true)
 |-- field2: string (nullable = true)
 |-- document: struct (nullable = true)
 |    |-- nestedField1: struct (nullable = true)
 |    |    |-- field3: string (nullable = true)
 |    |-- nestedField2: struct (nullable = true)
 |    |    |-- anotherNestedField1: struct (nullable = true)
 |    |    |    |-- field: string (nullable = true)

As you can see spark seems unable to infer the complete schema of the document and it misses the "theMissingNestedField" struct.

I didn't force any particular schema since the requirement is that the nested fields could change at any time.

Since the documents are very long and with sensitive data I used an example to give an understanding of the issue I'm currently facing. If needed I can try to provide a masched example so that you can try on your own.

Any help would be appreciated, thanks.


Solution

  • Finally found the solution, the issue was that the schema on the new ES server was different wrt the previous one.

    To my understanding, the spark driver creates the schema of the ES mapping, so it was impossible to have a dataframe with the fields that were not mapped in the index.

    The solution was to correctly map the index (like in the old instance) and the etl job now works as expected.