We're using postgres as a data warehouse, which we manage with dbt.
In it, there is a very large and thin table like this (~10 million rows):
col1 | col2 |
---|---|
'value1' | 123 |
... | ... |
'valueN' | 123 |
Also, there is a data enrichment API that accepts the existing columns as parameters and returns the relevant enriched data:
https://api.myapi.com/&q=value1+123
Each API call returns in ~1sec
What is/are performant way/s to enrich the data without leaving the data warehouse such that I can add a third column like so:
col1 | col2 | enrichedCol |
---|---|---|
'value1' | 123 | enriched1 |
... | ... | ... |
'valueN' | 123 | enrichedN |
Ideally, this is done with SQL or a stored procedure.
Snowflake has External Functions that achieve something similar.
We've considered an asyncio in a plpython3u stored procedure approach, but I feel that postgres should have a better solution.
An explanation of why doing this is an anti-pattern is also acceptable.
Postgres and dbt
guy here, looks like so far this would have been better received on the discourse than the stack-overflow but imo, great question.
Basically, the way I'm interpreting this so far, you are asking about patterns and to be fair, the vast majority of dbt users so far have only used 1, maybe 2 of the architectural "pattern" from what I've seen here. That will change over time as I believe Spark, Firebolt, and the "next-gen" datastores will likely support features like "api to db pipelines as sql" very soon.
I can think of at least 3 options here:
This would be how I roughly illustrate these patterns for documentation or a project proposal:
To describe those a little further:
Example of this is using a directory structure on a cloud storage service like in the following write-up: Loading and transforming data into BigQuery using dbt
** edit ** (Additional point for clarity, this is NOT the same as a source read-only ELT service like Fivetran or Airbyte. Here I'm referring to the pattern of using a service that is capable of doing a read/update or read/upsert operation for the circumstances where that is required like watching an Address table in a DB and then enriching that information with a sanitized version of the address from Google Places API or something equivalent).
Happy to expand further but would generally recommend re-opening this question over in the [postgres][rest][webservices]
tags since dbt is mostly irrelevant to the question so far.