Selecting InfluxDB records with non-empty tags gets terribly slow

I'm running Influx v1.3. I have a measurement with ~2+ million records within 1 month timespan and 10 tags within. Out of those, I'm interested in user_id and article_id.

I'm trying to extract number of events logged for specific user_id with non-empty article_id.

The query for selecting all user's events is blazingly fast:

select count(*) from pageviews where user_id = '92363';

The query for specific event for user and article is blazingly fast:

select * from pageviews where user_id = '92363' and article_id = '879729';

But when I try to add filter for non-empty (or empty) article_id, the query runs for tens of seconds.

select count(*) from pageviews where user_id = '92363' and article_id != '';
select count(*) from pageviews where user_id = '92363' and article_id !~ /.*/;

Is there something I'm doing wrong or something I should configure within the database? This doesn't seem to be right. The number of user's events is <100, I'd literally went through them faster manually.

I'm currently running the default configuration.

Thanks.

Solution

With queries using != or =~ the DB has to compare against each entry for that tag. If you have 2M entries, it will be slow. When using regular expressions (=~), even slower.

In short, no, you are not doing anything wrong. These types of queries are O(n) in influxdb where n is the number of rows you are comparing against.

I would suggest re-thinking your schema to avoid such queries. You could, for example, set an integer tag has_article that is either 0 or 1 indicating presence of an article. The query can then use has_article = 1.

Then again, whether the data set and data model is a good fit for influxdb is another question - the data set does not seem to be time series data.

InfluxDB is a specialised time series DB, not a general purpose data store.