database-design architecture aggregation-framework middleware

Backend solution for fetching and transforming data from various third-party APIs

We are building new feature sets for one of our financial applications. We have our own SQL server database and we will be calling multiple RESTful APIs that return JSON responses. For e.g. some return news data, some return stock info, some return finance data, and our own SQL server database has employee data. So, they all come with their own different data format. This new app we are building is going to aggregate all that data and transform it into a meaningful display on the web like mint.com does.

Web application will display analytical reports based on these data
There will be an option to download reports through various templates

We are completely open in terms of technology stack for our backend and middle-tier. As a first thought, NoSQL like MongoDB and Elasticsearch for search and reporting comes to our mind. There will be a web application build on top of these data (stored or retrieved from API), most likely in Asp.net MVC.

We need your input, especially if you have experience with building similar enterprise solution.

Can you please share your opinions on this?

What are some good tech stack you would pick for this app?
How would that scale now and in the future when the API data format changes?
Performance is also important since data will be displayed on the web UI.

Solution

We have a similar setup to what you are mentioning, using ASP.Net MVC with ElasticSearch (SQL server for relational data, periodically updating ES), aggregating data (XML/JSON) from multiple sources, although with the purpose of improving searching and filtering results instead of reporting. However, I would expect that the scenario you are looking at would also be a suitable match for ElasticSearch, depending on your specific requirements.

1) Since you are already using SQL Server (and I expect are familiar with that), I would suggest combining that with ElasticSearch - the additional mongodb layer seems unnecessary, in terms of maintenance of another technology and development to fit that integration. There is a very good C# library (two actually, ElasticSearch.Net and NEST, used together) that exposes most of the ES functionality.

2) We chose ElasticSearch for its scalability in combination with flexibility and ease-of-use. A challenge you may face could be mapping the documents from C# classes to ElasticSearch documents. In essence, it is incredibly easy to set up, however you do need to do some planning to index data the way you want to search and retrieve it. So if choosing ES as a platform, spend some time with the structure of the documents - by default, dynamic mapping is enabled, so you can pretty much throw any JSON into a document. However, for a production environment, it's better to turn that off and have one or more mappings set up, so they can be queried in a standardized way.

3) Performance is a key factor for us as well, which is why we were looking at Lucene-based engines like Solr and ElasticSearch when doing research, along with NoSQL databases. ElasticSearch outperforms SQL Server by 10 to 1 or better, in most scenarios. Solr vs. ElasticSearch performance depends on scenario, benchmarks and comparisons are around if you Google them. The exception may be if many documents should be retrieved in one query - ES (or Lucene) is not made for that use case, it's best for fast retrieval of fewer results (similar to Google's per page results count) per page. If you need 1000 documents per page/result, a NoSQL database may be a better option.

ElasticSearch is fast to get up and running - install it on a local development box and try it out, you'll get a feel for if it fits.