Ref: https://www.elastic.co/guide/en/elasticsearch/reference/current/removal-of-types.html
We are introducing log using elasticsearch for a multi tenant application (eg: around 10000 tenants). We need to log profile_edits, user_comments, cron_activities, category_edits and about 30 more categories to log.
I found two ways to store these logs.
POST tenant-1/_doc
{
"type" : "profile_edits",
"fullname" : "NewName",
"age" : 11,
"score" : 999
...
}
POST tenant-1/_doc
{
"type" : "user_comments",
"user" : "User1",
"comment" : "Nice!"
}
In this way I could be having no of indices = no of tenants.
POST profile_edits/_doc
{
"tenant" : 1,
"fullname" : "NewName",
"age" : 11,
"score" : 999
}
POST user_comments/_doc
{
"tenant" : 1,
"user" : "User1",
"comment" : "Nice!"
}
In this way I need around ~35 index in total.
Which method works better?
As agree with another user @Evaldas that this is opinion based but In my opinion, and quite some experience of large scale ES deployment, I also feel having index based on your category like profile_edits
, user_edits
and have a common field may be tenant_id
which will be useful for filtering the data for a particular tenant
. few pros to this approach.
You will have comparatively very less indices management overhead as instead of 10k you need to manage only 35 indices.
you can still get better performance as you can have a filter on tenant_id
and filers are by default cached in ES, refer to filter context for more info.
Cluster state(info about all the shards and state) will be much smaller, although in newer version ES optimized the publish of the cluster state, but if you are on really old version it would be helpful and give better performance.
Last but not the least, your use-case is similar to what is discussed in this official ES blog and they also recommend to avoid too many indices and rather suggest to group them, below is tip from the same blog
TIP: In order to reduce the number of indices and avoid large and sprawling mappings, consider storing data with similar structure in the same index rather than splitting into separate indices based on where the data comes from. It is important to find a good balance between the number of indices and shards, and the mapping size for each individual index. Because the cluster state is loaded into the heap on every node (including the masters), and the amount of heap is directly proportional to the number of indices, fields per index and shards, it is important to also monitor the heap usage on master nodes and make sure they are sized appropriately.