I am building a self-contained data analytics project in Python. As the project needs to be scalable it requires a fairly solid pipeline of data processing and analytics.
So far I'm planning to use Singer (https://www.singer.io/) to ingest the data from multiple sources, with a PostgreSQL target.
The pipeline currently looks a bit like this: data sources --> ingest --> store in postgreSQL DB --> data processing layer --> analytics environment.
I have already written Pandas code to clean data in the data processing layer - but I'm not sure if cleaning data as it is being pulled from the database into the analytics environment is the best practice. Especially as the data processing will then be repeated each time the data is pulled. Should I process the data in the ingestion layer? How would I do that with a Singer pipeline?
As always it depends.
This is not an exhaustive list but I hope it shows how you should start thinking about this problem.