Search code examples
architectureairflowsystemspotifyscalability

How can i build a web service which loads users daily listening history from spotify and presents interesting analytics?


I have this project idea which involves using the spotify API to load users daily listening history and presenting a weekly wrapped to the user by mail, I need some advice on the architecture. My plan is to use Airflow for running daily tasks and loading the data to postgres and then running a weekly task to generate wrapped. This works for a single user , but how can make it scalable , or is their another way to build this service that i am missing?


Solution

  • Lets assume you have a list of users, and you want to fetch their data from spotify API.

    This works for a single user , but how can make it scalable

    By scalable, I understand that you want to fetch the data efficiently, and here are some options:

    • Load the users data in a loop user after user: slow method if you have a big number of users -> not scalable but easy
    • Create a kubernetes pod from airflow which use multiprocessing to process data in parallel -> scalable but difficult
    • Create as spark job which query the API in parallel and run it from Airflow -> very scalable and not difficult
    • Use Dynamic Task Mapping to map each user in a separate Airflow task to fetch and process its data -> scalable as much as your Airflow is scalable and easy

    Whatever the method you choose, you can store the data in a remote storage, then process them in one scalable task (spark/trino for ex) to create reports and send them in a third task.