Search code examples
javasqloracleapache-spark-sqlapache-drill

One SQL query to access multiple data sources in Java (from oracle, excel, sql server)


I need to develop application that can be getting data from multiple data sources ( Oracle, Excel, Microsoft Sql Server, and so on) using one SQL query. For example:

 SELECT o.employeeId, count(o.orderId) 
    FROM employees@excel e. customers@microsoftsql c, orders@oracle o 
    WHERE o.employeeId = e.employeeId and o.customerId = c.customerId 
    GROUP BY o.employeeId;

This sql and data sources must be changes dynamically by java program. My customers want to write and run sql-like query from different database and storage in same time with group by, having, count, sum and so on in web interface of my aplication. Other requirements is perfomance and light-weight.

I find this way to do it (and what drawbacks I see, please, fix me if I wrong):

  1. Apache Spark (drawbacks: heavy solution, more better for BigData, slow if you need getting up-to-date informations without cached it in Spark),

  2. Distributed queries in SQL server (Database link of Oracle, Linked server of Microsoft SQL Server, Power Query of Excel) - drawbacks: problem with change data sources dynamically by java program and problem with working with Excel,

  3. Prestodb (drawbacks: heavy solution, more better for BigData),

  4. Apache Drill (drawbacks: quite young solution, some problem with not latest odbc drivers and some bugs when working),

  5. Apache Calcite (ligth framework that be used by Apache Drill, drawbacks: quite young solution yet),

  6. Do join from data sources manually (drawbacks: a lot of work to develop correct join, "group by" in result set, find best execution plan and so on)

May be, do you know any other way (using free open-source solutions) or give me any advice from your experience about ways in above? Any help would be greatly appreciated.


Solution

  • One of the appropriate solution is DataNucleus platform which has JDO, JPA and REST APIs. It has support for almost every RDBMS (PostgreSQL, MySQL, SQLServer, Oracle, DB2 etc) and NoSQL datastore like Map based, Graph based, Doc based etc, database web services, LDAP, Documents like XLS, ODF, XML etc.

    Alternatively you can use EclipseLink, which also has support for RDBMS, NoSQL, database web services and XML.

    By using JDOQL which is part of JDO API, the requirement of having one query to access multiple datastore will be met. Both the solutions are open-source, relatively lightweight and performant.

    Why did I suggest this solution ?

    • From your requirement its understood that the datastore will be your customer choice and you are not looking for a Big Data solution.
    • You are preferring open-source solutions, which are light weight and performant.
    • Considering your use case you might require a data management platform with polyglot persistence behaviour, which has the ability to leverage multiple datastore, based on your/customer's use cases.

    To read more about polyglot persistence

    https://dzone.com/articles/polyglot-persistence-future

    https://www.mapr.com/products/polyglot-persistence