We are having use case for retail industry data. We are into making of EDW.
We are currently doing reporting from HAWQ.But We wanted to shift our MPP database from Hawq into Greenplum. Basically,We would like to make changes into current data pipeline.
Our confusion points about gpdb :
How Greenplum is physically or logically going to help in SQL transformation and reporting.
Which file format should i opt for storing the files in GPDB while
HAWQ we are storing files in plain text format.What are the supported format is good for writing in gpdb like avro,parquet etc.
How is data file processed from GPDB . so, that it also bring faster reporting and predictive analysis.
Is there any way to push data from HAWQ into Greenplum? We are
looking for guidance how to take shift our reporting use case from
HAWQ INTO Greenplum.
Any help on it would be much appreciated?
This query is sort of like asking, "when should I use a wrench?" The answer is also going to be subjective as Greenplum can be used for many different things. But, I will do my best to give my opinion because you asked.
HOW gpdb layer going to affect our existing data pipeline. Here data pipeline is external system --> talend -->hadoop-hawq-->tableau. We want to transform our data pipeline as external system --> talend -->hadoop-hawq-->greenplum -->tableau.
There are lots of ways to do the data pipeline your goal of loading data into Hadoop first and then load it to Greenplum is very common and works well. You can use External Tables in Greenplum to read data in parallel, directly from HDFS. So the data movement from the Hadoop cluster to Greenplum can be achieved with a simple INSERT statement.
INSERT INTO greenplum_customer SELECT * FROM hdfs_customer_file;
How Greenplum is physically or logically going to help in SQL transformation and reporting.
Isolation for one. With a separate cluster for Greenplum, you can provide analytics to your customers without impacting the performance of your Hadoop activity and vice-versa. This isolation also can provide an additional security layer.
Which file format should i opt for storing the files in GPDB while HAWQ we are storing files in plain text format.What are the supported format is good for writing in gpdb like avro,parquet etc.
With your data pipeline as you suggested, I would make the data format decision in Greenplum based on performance. So large tables, partition the tables and make it column oriented with quicklz compression. For smaller tables, just make it append optimized. And for tables that have lots of updates or deletes, keep it the default heap.
How is data file processed from GPDB . so, that it also bring faster reporting and predictive analysis.
Greenplum is an MPP database. The storage is "shared nothing" meaning that each node has unique data that no other node has (excluding mirroring for high-availability). A segment's data will always be on the local disk.
In HAWQ, because it uses HDFS, the data for the segment doesn't have to be local. Day 1, when you wrote the data to HDFS, it was local but after failed nodes, expansion, etc, HAWQ may have to fetch the data from other nodes. This makes Greenplum's performance a bit more predictable than HAWQ because of how Hadoop works.
Is there any way to push data from HAWQ into Greenplum? We are looking for guidance how to take shift our reporting use case from HAWQ INTO Greenplum.
Push, no but pull, Yes. As I mentioned above, you can create an External Table in Greenplum to SELECT data from HDFS. You can also create Writable External Tables in Greenplum to push data to HDFS.