Search code examples
sqlpostgresqldbt

enriching data from postgres table from a REST API in OLAP use case


We're using postgres as a data warehouse, which we manage with dbt.

In it, there is a very large and thin table like this (~10 million rows):

col1 col2
'value1' 123
... ...
'valueN' 123

Also, there is a data enrichment API that accepts the existing columns as parameters and returns the relevant enriched data:

https://api.myapi.com/&q=value1+123

Each API call returns in ~1sec

What is/are performant way/s to enrich the data without leaving the data warehouse such that I can add a third column like so:

col1 col2 enrichedCol
'value1' 123 enriched1
... ... ...
'valueN' 123 enrichedN

Ideally, this is done with SQL or a stored procedure.

Snowflake has External Functions that achieve something similar.

We've considered an asyncio in a plpython3u stored procedure approach, but I feel that postgres should have a better solution.

An explanation of why doing this is an anti-pattern is also acceptable.


Solution

  • Postgres and dbt guy here, looks like so far this would have been better received on the discourse than the stack-overflow but imo, great question.

    Basically, the way I'm interpreting this so far, you are asking about patterns and to be fair, the vast majority of dbt users so far have only used 1, maybe 2 of the architectural "pattern" from what I've seen here. That will change over time as I believe Spark, Firebolt, and the "next-gen" datastores will likely support features like "api to db pipelines as sql" very soon.

    I can think of at least 3 options here:

    1. Dump to "data lake"
    2. ETL / ELT Engine
    3. Polling Service (as you suggested)

    This would be how I roughly illustrate these patterns for documentation or a project proposal:

    enter image description here

    To describe those a little further:

    1. Dump everything from the API in bulk to a file (or even better a snapshotted file just in case some data is later redacted) and then make that data available via dbt-external-tables as a view so that you are always merging the latest from the archive at the merge view query time.

    Example of this is using a directory structure on a cloud storage service like in the following write-up: Loading and transforming data into BigQuery using dbt

    1. Use an external service from the database to poll the api with data from the database. (Probably the slowest overall option since the triple network IO of read from DB, read from API, and then write to DB.)

    ** edit ** (Additional point for clarity, this is NOT the same as a source read-only ELT service like Fivetran or Airbyte. Here I'm referring to the pattern of using a service that is capable of doing a read/update or read/upsert operation for the circumstances where that is required like watching an Address table in a DB and then enriching that information with a sanitized version of the address from Google Places API or something equivalent).

    1. Create a procedure (ala this SO Postgres Question which, combined with a trigger, could return a value from an external service via a standard cURL call. (My view on the problem with this will be maintainability - how will you be alerted if the API changes and extracts unexpected values, how do you log timing & success of the procedure calls, does your DE have control over the DB as well as the dbt code base and can perfectly anticipate how dbt changes will effect this proc?, etc.)

    Happy to expand further but would generally recommend re-opening this question over in the [postgres][rest][webservices] tags since dbt is mostly irrelevant to the question so far.