Here's what one of my document might look like
{
"CC":{"colors":["Blue","Green","Yellow"]},
"CN":{"colors":["White","Green","Blue"]},
"WA":{"colors":["Orange","Green","Blue"]},
...
}
I want a terms aggregation, on the intersection of two fields CC.colors
and CN.colors
. That is, for this document, that field will have ["Green", "Blue"]
in the intersection, and I want a term aggregation on this intersection.
As far as I understand, there are two ways to do it.
1) A painless script in terms aggregation, which returns the intersection of these two arrays for each document.
2) A new field created during index time, maybe called CC_CN.colors
, which holds intersection for all docs.
I can't go ahead with 2 because my combinations will be too many. I can have any need during search time, like CC_CN, or CC_WA, or WA_CN_CC etc.
For 1), it works, but gets painfully slow. One reason is that 1) cannot use global ordinals.
Is there any trick, that I can ask elastic to build a custom global ordinal for my painless terms aggregation? I know there are just 25 colors in my system, so can give all colors to elastic somewhere, and "assure" them that I'll not return anything else but these colors from my aggregation?
Or, if I encode and store numbers instead of strings in index, would this be faster for elastic? e.g. 0 instead of "Black", 1 instead of "Green" etc.?
Other than intersection, my other use cases involve union etc. as well. Thanks for reading!
To answer it myself, we ended up asking for these arrays in _source
and performing union/intersection in Ruby.
It is also possible to do this in painless, and that offers a bit better performance. Elastic uses map
to do aggregation, and I couldn't figure out any way to use global ordinals. I don't think its possible.
We wrote code that generates painless code to perform intersection and union between arrays. For any future wanderer, here's what the generated code looks like:
This is for union:
Stream stream = [].stream();
String[] stream_keys = new String[] {'CC.colors', 'CN.colors'};
for (int i = 0; i < stream_keys.length; ++i) {
if (doc.containsKey(stream_keys[i])) {
stream = Stream.concat(stream, doc[stream_keys[i]].stream());
}
}
stream = stream.distinct();
And this is for intersection (stream, list_0_stream and list_1_stream intersection):
List list_0 = list_0_stream.collect(Collectors.toList());
List list_1 = list_1_stream.collect(Collectors.toList());
return stream.filter(list_0::contains).filter(list_1::contains).toArray();
The performance seems to be acceptable.