I am trying to build up a big data platform to receive and store in Hadoop large amount of heterogeneous data like (documents,videos,images,sensors data, etc) then implement classification process.
So what architecture can help me as I’m currently using
VMware VSphere EXSi
Hadoop
Habse
Thrift
XAMPP
All these working fine but I don’t know how to receive a large amount of data and how to store the data because I discovered that Hbase is a column-oriented data base and it’s not data warehouse.
You have to customize solution for type of Big Data ( Structured, Semi-Structured and Un-Structured)
You can use HIVE/HBASE for structured data if total data size <= 10 TB
You can use SQOOP to import structured data from traditional RDBMS database Oracle, SQL Server etc.
You can use FLUME for processing Un-structured data.
You can use Content Management System to process Un-structured data & Semi-Structured data - Tera Or Peta bytes of data. If you are storing un-structured data, I prefer to store the data in CMS and use meta data information in NoSQL database like HBASE
To process Big data streaming, you can use PIG.
Have a look at Structured Data and Un-Structured data handling in Hadoop