We have a dictionary whose source is an external postgres, the layout is complex_key_cache
and the lifetime
is 3600-7200.
When I do
select * from dictionaries.dictname limit 5
I only get some results (less than 5, I think the dict has about 10k rows). But if I do
SELECT dictGet('dictname', 'catId', tuple('somestring'))
then assuming the key exists I always get a result. And on some of the next calls to select *
I will get that row (key somestring
) returned.
And the same happens with this query
SELECT d,
catId,
dictGet('dictname', 'catId', tuple(base64Decode(d))) AS categoryId
FROM dictionaries.dictname
I only get some results - probably the ones that are in cache.
So 2 questions:
How can I ensure that I get all results given a list? eg.
SELECT d,
catId
FROM dictionaries.dictname d
WHERE d in ('somestring', 'anotherstring')
And why does this happen? Why does a single dictGet
seem to work (doesn't use cache?) but when I try to use IN
operator or the dictGet
joined with the dict table it doesnt work (seems to use some kind of cache)?
From what I understand, since this dict uses complex_key_cache
clickhouse only uses the cache when not filtering (eg listing all keys in dict) and uses the cache and maybe the source when filtering for results.
This query was using the cached dict in the FROM
clause and so the filtering was applied to the cached dict:
SELECT d,
catId
FROM dictionaries.dictname d
WHERE d in ('somestring', 'anotherstring')
but if we do it this way
SELECT d,
dictGet('dictname', 'category_id', tuple(d)) AS categoryId
FROM (
SELECT ['somestring', 'anotherstring'] AS d
) ARRAY
JOIN d
it forces the dictGet
to go the source if it doesn't find the key in the cache.