Search code examples
architecturedistributed-systemsystem-designdistributed-database

How to design a distributed write-heavy data store


It's actually an interview question I'm thinking of for 2 month and can't find a suitable architecture.

The problem

We want to build a small analytics system for fraud detection on orders.

System has the following requirements

  • Not allowed to use any technology from the market (MySql, Redis, Hadoop, S3 etc)
  • Needs to scale as the data volume grows
  • Just a bunch of machines, with disks and decent amount of memory
  • 10M Writes/Day

The system needs to provide following API

  • /insertOrder(order): Order
    Add an order to the storage. The order can be considered blob with 1-10KBs in size, with an orderId , beginTime, and finishTime as distinguished fields
  • /getLongestNOrdersByDuration(n: int, startTime: datetime, endTime: datetime): Order[]
    Retrieve the longest N orders that started between startTime and endTime,
    as measured by duration finishTime - beginTime
  • /getShortestNOrdersByDuration(n: int, startTime: datetime, endTime: datetime): Order[]
    Retrieve the shortest N orders that started between startTime and endTime,
    as measured by duration finishTime - beginTime

Solution

  • Look at using druid database. If you time series data

    • It should scale well with as volume of data grows
    • Time duration queries can be answered effectively

    https://druid.apache.org/ - This has been used as analytics db at scale in Fortune 500 companies