Search code examples
databasearchitectureazure-sql-databaseamazon-rdsetl

Which database to choose in order to store data coming from flat files CSV, html


I need to design a scalable database architecture in order to store all the data coming from flat files - CSV, html etc. These files come from elastic search. most of the scripts are created in python. This data architecture should be able to automate most of the daily manual processing performed using excel, csv, html and all the data will be retrieved from this database instead of relying on populating within csv, html.

Database requirements:

  1. Database must have a better performance to retrieve data on day to day basis and it will be queried by multiple teams.
  2. ER model, schema will be developed for the data with logical relationship.
  3. The database can be within cloud storage.
  4. The database must be highly available and should be able to retrieve data faster.
  5. This database will be utilized to create multiple dashboards.
  6. The ETL jobs will be responsible for storing data in the database.
  7. There will be many reads from the database and multiple writes each day with lots of data coming from Elastic Search and some of the cloud tools.

I am considering RDS, Azure SQL, DynamoDB, Postgres or Google Cloud. I would want to know which database engine would be a better solution considering these requirements. I also want to know how ETL process should be designed- lambda or kappa architecture.


Solution

  • To store the relational data like CSV and excel files, you can use relational database. For flat files like HTML, which doesn't required to be queried, you can simply use Storage account in any cloud service provider, for example Azure.

    Azure SQL Database is a fully managed platform as a service (PaaS) database engine that handles most of the database management functions such as upgrading, patching, backups, and monitoring without user involvement. Azure SQL Database is always running on the latest stable version of the SQL Server database engine and patched OS with 99.99% availability. You can restore the database at any point of time. This should be the best choice to store relational data and perform SQL query.

    Azure Blob Storage is Microsoft's object storage solution for the cloud. Blob storage is optimized for storing massive amounts of unstructured data. Your HTML files can be stored here.

    The ETL jobs can be performed using Azure Data Factory (ADF). It allows you to connect almost any data source (including outside Azure) to transform the stored dataset and store it into desired destination. Data flow transformation in ADF is capable to perform all the ETL related tasks.