I have a large collection of roughly 3.2 million records, this collection data is being updated monthly but the source data is being fetched as-is, meaning I don't get just the updated records but everything. In terms of performance, is it better to simply remove the collection and insert everything or do an update for each record? Also is there a good way to compare existing record with the one being read from the source to check if there's any change?
Thanks.
Also is there a good way to compare existing record with the one being read from the source to check if there's any change?
You're searching for a Change Detection System : it's a problem commonly described for ETL system. I suggest you to read something about ETL process (Kimball's Datawarehouse ETL Toolkit is a good source). In general detecting changes is an hard problem and involves the use of snapshot in order to calculate differences. If you're sure that your collection will always remain in a mongo storage you can see if it's possible to mess around with mongo log.
Furthermore consider that change detection is very coupled with the structure and the meaning of your data: e.g. if you have insertion-only collection you can get changed data with _id
.
The problem is too complex to give answers like "do this and that and you'll get it"; you have to analyze your data and understand what is the better method: refer to literature to find known solutions and avoid reinventing the wheel.
In terms of performance, is it better to simply remove the collection and insert everything or do an update for each record?
Once again, you have to known how you data is structured. If you have a collection that has more changes than constant parts you'd better reload the entire collection and avoid tracking changes. If your collection has changeset that is considerably smaller than the whole collection updating existing document leads to better performance.
Hope this helps.