Search code examples
clickhouseeventual-consistency

Clickhouse external dictionary returns inconsistent results


We have a dictionary whose source is an external postgres, the layout is complex_key_cache and the lifetime is 3600-7200.

When I do

select * from dictionaries.dictname limit 5

I only get some results (less than 5, I think the dict has about 10k rows). But if I do

SELECT dictGet('dictname', 'catId', tuple('somestring'))

then assuming the key exists I always get a result. And on some of the next calls to select * I will get that row (key somestring) returned.

And the same happens with this query

SELECT d,
       catId,
       dictGet('dictname', 'catId', tuple(base64Decode(d))) AS categoryId
  FROM dictionaries.dictname

I only get some results - probably the ones that are in cache.

So 2 questions:

How can I ensure that I get all results given a list? eg.

SELECT d,
       catId
  FROM dictionaries.dictname d
 WHERE d in ('somestring', 'anotherstring')

And why does this happen? Why does a single dictGet seem to work (doesn't use cache?) but when I try to use IN operator or the dictGet joined with the dict table it doesnt work (seems to use some kind of cache)?


Solution

  • From what I understand, since this dict uses complex_key_cache clickhouse only uses the cache when not filtering (eg listing all keys in dict) and uses the cache and maybe the source when filtering for results.

    This query was using the cached dict in the FROM clause and so the filtering was applied to the cached dict:

    SELECT d,
           catId
      FROM dictionaries.dictname d
     WHERE d in ('somestring', 'anotherstring') 
    

    but if we do it this way

    SELECT d,
           dictGet('dictname', 'category_id', tuple(d)) AS categoryId
      FROM (
            SELECT ['somestring', 'anotherstring'] AS d
           ) ARRAY
      JOIN d
    

    it forces the dictGet to go the source if it doesn't find the key in the cache.