I am new to data analytics and the big data concepts. I stuck for deciding, what would be the technology to implement my requirement.
My need is as follows:
My client is using more than one oracle databases as their organization's ERP backend. These two databases having different structures and different types of data. I need to create a data analytics application with the data from these two databases. What technology can be adapted by me for this implementation. Can I go with Hadoop and it's associated applications?.
If I go with hadoop, how can I synch my oracle databases to hadoop. I am looking for a solution with realtime synching.
Or can I use the native connection with databases to implement the database access and create my new application? The size of the databases would be around 1.5 TB.
There are a lot of layers to this question, so I'll keep it somewhat general to give you a push in the right direction.
You suggest two approaches - one would keep your data in Oracle, the other would bring it to Hadoop.
If you stay in Oracle, you may need to use a DI tool such as Informatica, Pentaho, SAS DI or SAS Enterprise to interrogate the different tables in different schemas, extract the data you need, and call in analytics either from native steps or by integrating Python, R or Weka scripts.
To the best of my knowledge, Hadoop doesn't natively integrate with Oracle but instead manages its own file system, HDFS. Sqoop jobs run on Hadoop can extract from Oracle and write to Hive or HBase tables, and then your integration would be using Hive Context on Spark, which enables you to perform analytics.
You may be able to interrogate the databases directly using R or Python. Packt offered a guide at one point on Business Intelligence Using R that included chapters on the ETL (Extract-Transform-Load) process using R. I will tell you this isn't a common solution in the industry because R is primarily an analyst's language, not an ETL Developer's tool. That said, R should be able to query most oracle databases unless they're really old and perform the integration and analytics. The downside is that R's kernel may need more processing power and threads than RStudio can provide - this is why Oracle SQL Developer and Toad handle large-scale queries so well. Python can probably perform the approach using the CX_oracle library.