Data cleaning before, during or after data ingestion?

I am building a self-contained data analytics project in Python. As the project needs to be scalable it requires a fairly solid pipeline of data processing and analytics.

So far I'm planning to use Singer (https://www.singer.io/) to ingest the data from multiple sources, with a PostgreSQL target.

The pipeline currently looks a bit like this: data sources --> ingest --> store in postgreSQL DB --> data processing layer --> analytics environment.

I have already written Pandas code to clean data in the data processing layer - but I'm not sure if cleaning data as it is being pulled from the database into the analytics environment is the best practice. Especially as the data processing will then be repeated each time the data is pulled. Should I process the data in the ingestion layer? How would I do that with a Singer pipeline?

Solution

As always it depends.

Cleaning data before ingest

Pros

It lowers network traffic / data volume
It requires less storage

Cons

It requires extra steps from each datasource
It is hard to orchestrate, monitor these

Cleaning data during ingest

Pros

Preliminary checks are located in a single place
You are able to report metrics
- ingested, dropped, ingested-dropped ratio, etc.
This step could be orchestrated and monitored easier

Cons

It is just a preliminary check
- During data modelling you might need to do further cleaning
The maintenance of these rules is a responsibility of the data pipeline engineer

Cleaning data after ingest

Pros

It can be used not just for preliminary checks
- For example: deduplication, filter unwanted outliers, etc.
Different cleansing steps can be defined on a data model basis

Cons

It requires more storage
Each data scientist has to implement his/her own cleansing steps

This is not an exhaustive list but I hope it shows how you should start thinking about this problem.