Search code examples

spark does not read all orc files from different folder using merge schema

I have three different orc files in three different folder, I want to read them all in to one data frame in one shot.

user1.orc at /data/user1/

| userid            |     name           |
|         1         |            aa      |
|         6         |            vv      |

user2.orc at /data/user2/

| userid            |     info           |
|         11        |            i1      |
|         66        |            i6      |

user3.orc at /data/user3/

| userid            |     con            |
|         12        |            888     |
|         17        |            123     |

I want to read all these at once and have the dataframe like below

| userid            |         name       |       info         |    con   |
|             1     |         aa         |       null         |  null    |
|             6     |         vv         |       null         |  null    |
|            11     |        null        |         i1         |  null    |
|            66     |        null        |         i6         |  null    |
|            12     |        null        |       null         |  888     |
|            17     |        null        |       null         |  123     |

so I used like this

val df"mergeSchema","true").orc("file:///home/hadoop/data/")

but its giving the common column across all files

| userid            |
|             1     |
|             6     |
|            11     |
|            66     |
|            12     |
|            17     |

So how to read all these three files in one shot ?


  • I have a very stupid workaround for you, just in case if you don't find any solution.

    Read all those files into different data frames and then perform a union operation, something like below:

    val user1 ="/home/prasadkhode/data/user1/").toJSON
    val user2 ="/home/prasadkhode/data/user2/").toJSON
    val user3 ="/home/prasadkhode/data/user3/").toJSON
    val result =

    and the output will be:

     |-- con: long (nullable = true)
     |-- info: string (nullable = true)
     |-- name: string (nullable = true)
     |-- userId: long (nullable = true)
    |con |info|name|userId|
    |null|null|vv  |6     |
    |null|null|aa  |1     |
    |null|i6  |null|66    |
    |null|i1  |null|11    |
    |888 |null|null|12    |
    |123 |null|null|17    |


    Looks like there is no support for mergeSchema for orc data, there is an open ticket in Spark Jira

    enter image description here