I'm running Influx v1.3. I have a measurement with ~2+ million records within 1 month timespan and 10 tags within. Out of those, I'm interested in user_id
and article_id
.
I'm trying to extract number of events logged for specific user_id
with non-empty article_id
.
The query for selecting all user's events is blazingly fast:
select count(*) from pageviews where user_id = '92363';
The query for specific event for user and article is blazingly fast:
select * from pageviews where user_id = '92363' and article_id = '879729';
But when I try to add filter for non-empty (or empty) article_id
, the query runs for tens of seconds.
select count(*) from pageviews where user_id = '92363' and article_id != '';
select count(*) from pageviews where user_id = '92363' and article_id !~ /.*/;
Is there something I'm doing wrong or something I should configure within the database? This doesn't seem to be right. The number of user's events is <100, I'd literally went through them faster manually.
I'm currently running the default configuration.
Thanks.
With queries using !=
or =~
the DB has to compare against each entry for that tag. If you have 2M entries, it will be slow. When using regular expressions (=~
), even slower.
In short, no, you are not doing anything wrong. These types of queries are O(n)
in influxdb where n is the number of rows you are comparing against.
I would suggest re-thinking your schema to avoid such queries. You could, for example, set an integer tag has_article
that is either 0
or 1
indicating presence of an article. The query can then use has_article = 1
.
Then again, whether the data set and data model is a good fit for influxdb is another question - the data set does not seem to be time series data.
InfluxDB is a specialised time series DB, not a general purpose data store.