Search code examples
sqlapirestsnowflake-cloud-data-platformdata-warehouse

How to get segmentation data from Snowflake Table in API efficiently and cost-effectively?


I have a segmentation project I am working on for my company and we have to create a pipeline to gather data from our app users and when they fit a segment then the app will receive that information and do something with it (not in my scope). So currently, the client connects and authenticates to an endpoint that allows their client to send JSON data to an Elasticsearch cluster (app started, level completed, etc). I'm then using an Azure Function to grab the live data every 5 minutes and store it in an Azure Blob Storage which then creates a queue that Snowflake reads and ingests the JSON files. We'd then use Snowflake to run a task per segment (that will be decided by the analysts or executives) and the data will be outputted to a table like the one below:

AccountID Game SegmentID CreatedAt DeletedAt
123456789 Game 1 1 2021-04-20 2021-04-21
123456789 Game 1 2 2021-04-20
123456789 Game 1 3 2021-04-20

Where SegmentID can represent something like

SegmentID SegmentType SegmentDescription
1 5 Day Streak User played for 5 consecutive days
2 10 Day Streak User played for 10 consecutive days
3 15 Day Streak User played for 15 consecutive days

In the next step of the pipeline, the same API the user authenticated with should post a request when the game boots up to grab all the segments that the user matches. The dev team will then decide where, when in the session and how to use the information to personalize content. Something like:

select
  SegmentID
from
  SegmentTable
where
  AccountID='{AccountID the App authenticated with}' and
  Game='{Game the App authenticated with}' and
  DeletedAt is null

Response:

SegmentID
2
3

Serialised:

{"SegmentID": [2,3]}

We expect to have about 300K-500K users per day. My question would be, what would be the most efficient and cost-effective way to get this information from Snowflake back to the client so that this amount of users wouldn't have issues when querying the same endpoint and it won't be costly.


Solution

  • OK, so a bit of a workaround, but I created an external function on Snowflake (using Azure Functions) that upserts data in a local MongoDB cluster. So the API connects to the MongoDB instance which can handle the large volume of concurrent connections and since it is on a local server it is quite cheap. The only cost is the data transfer from Snowflake to MongoDB and the running of the App Service Plan on Azure Functions (could not use consumption-based as to send data to our internal server I needed to create a VNET, NAT Gateway and a Static Outbound IP Address in Azure) and the API Management Service I had to create in Azure.

    So how it works? For each stored procedure in Snowflake, at the end I am collecting the segments which have changed (New row or DELETED_AT not null) and triggering the external function which upserts the data in MongoDB using the pymongo client.