We host data for an auditing service. Every few months, a new audit comes out with similar questions to previous audits of the same category. Since questions can change verbiage and number, we store each question in each audit separately (we do link them through a "related_questions" table).
audits
id | name | passing_score |
---|---|---|
1 | audit_1 | 100 |
2 | audit_2 | 150 |
questions
id | audit_id | text |
---|---|---|
1 | 1 | q1 |
2 | 1 | q2 |
3 | 2 | q1 |
4 | 2 | q2 |
We then have a surveys and responses table. Surveys are the overall response to an audit, while responses store the individual responses to each question.
surveys
id | audit_id | overall_score | pass |
---|---|---|---|
1 | 1 | 120 | true |
2 | 1 | 95 | false |
3 | 2 | 200 | true |
4 | 2 | 100 | false |
responses
id | survey_id | question_id | score |
---|---|---|---|
1 | 1 | 1 | 60 |
2 | 1 | 2 | 60 |
3 | 2 | 1 | 60 |
4 | 2 | 2 | 35 |
5 | 3 | 3 | 100 |
6 | 3 | 4 | 100 |
7 | 4 | 3 | 50 |
8 | 4 | 4 | 50 |
The analyze threshold is base threshold + scale factor * number of tuples
. The problem with this is that once an audit has finished (after a few months), we'll never receive new surveys or responses for that category of data. The new data that comes in is conceptually all that needs to be analyzed. All data is queried, but the new data has the most traffic.
If 10% is the ideal scale factor for today and analyze autoruns once every week, a couple years from now analyze may autorun once every 4 months due to the number of tuples. This is problematic when the past 3 months of data is for questions that the analyzer has never seen and so there are no helpful stats for the query planner on this data.
We could set the scale factor extremely low for this table, but that seems like a cheap solution that could cause issues in the future.
If you have a constant data modifications rate, setting autovacuum_analyze_scale_factor
to 0 for that table and only using autovacuum_analyze_threshold
is a good idea.