Search code examples
pythonpostgresqlarchitecture

Data cleaning before, during or after data ingestion?


I am building a self-contained data analytics project in Python. As the project needs to be scalable it requires a fairly solid pipeline of data processing and analytics.

So far I'm planning to use Singer (https://www.singer.io/) to ingest the data from multiple sources, with a PostgreSQL target.

The pipeline currently looks a bit like this: data sources --> ingest --> store in postgreSQL DB --> data processing layer --> analytics environment.

I have already written Pandas code to clean data in the data processing layer - but I'm not sure if cleaning data as it is being pulled from the database into the analytics environment is the best practice. Especially as the data processing will then be repeated each time the data is pulled. Should I process the data in the ingestion layer? How would I do that with a Singer pipeline?


Solution

  • As always it depends.

    Cleaning data before ingest

    Pros

    • It lowers network traffic / data volume
    • It requires less storage

    Cons

    • It requires extra steps from each datasource
    • It is hard to orchestrate, monitor these

    Cleaning data during ingest

    Pros

    • Preliminary checks are located in a single place
    • You are able to report metrics
      • ingested, dropped, ingested-dropped ratio, etc.
    • This step could be orchestrated and monitored easier

    Cons

    • It is just a preliminary check
      • During data modelling you might need to do further cleaning
    • The maintenance of these rules is a responsibility of the data pipeline engineer

    Cleaning data after ingest

    Pros

    • It can be used not just for preliminary checks
      • For example: deduplication, filter unwanted outliers, etc.
    • Different cleansing steps can be defined on a data model basis

    Cons

    • It requires more storage
    • Each data scientist has to implement his/her own cleansing steps

    This is not an exhaustive list but I hope it shows how you should start thinking about this problem.