Search code examples
databasetime-seriesvictoriametrics

How MergeSet works in VictoriaMetrics?


I met a research paper referred MergeSet from VictoriaMetrics.

It says

MergeSet is a simplified one-level LSM-tree implemented by VictoriaMetrics [6], which concatenates a tag and a posting ID together in a key, and the posting list is naturally formed because of the sorted order maintained by the LSM-tree.

Although I tried to read the source code from VM yet not sure about what MergeSet is.

For a common LSM-tree, each element is a key-value pair sorted by the key. If MergeSet follows this paradigm, then my questions are as follows:

  1. Does MergeSet concatenate each tag key-value pair with related TSID and store it as the key of a tuple in the MergeSet.table?
  2. If the tag key-value pair and TSID are both embedded in keys, what does the MergeSet.table value represent?

If not, should I consider MergeSet as a (partially) sorted string set inspired by the LSM-tree, utilizing on-disk capacity?


Solution

  • The github.com/VictoriaMetrics/VictoriaMetrics/lib/mergeset package implements LSM-like data structure with the following properties:

    • It stores opaque byte slices in sorted order via Table.AddItems method.
    • It processes stored byte slices in blocks of up to 64 kb in size. Every block is compressed before storing to disk. This reduces disk IO and disk space usage.
    • It merges similarly-sized parts into bigger parts in background in order to keep the number of parts under control. This improves compression ratio and query speed.
    • It provides O(log(N)) search and O(1) prefix scan over sorted byte slices via TableSearch struct.
    • It allows making instant snapshots as described in this article.

    The mergeset is used by VictoriaMetrics for storing various indexes, which are collectively named indexdb. These indexes include the following entries:

    • metricName -> metricID, which allows locating internal id of time series (aka metricID) by canonical name of the time series (aka meteicName) when storing the ingested raw samples into VictoriaMetrics. The canonical name of time series includes metric name plus all the labels of the time series sorted in a particular order.

    • metricID -> metricName, which allows locating metrics names plus all the labels for time series with the given internal id.

    • label=value -> metricID, which allows locating time series with the given label=value label. These entries are known as inverted index, and they are used for fast search of time series by the given label filters.

    How does VictoriaMetrics store these entries in the mergeset, which works only with sorted byte slices? It marshals entries into byte slices in the way they can be searched via prefix scan. It also prepends byte slices for every entry type with an individual prefix, so they do not clash with each other. See currently supported prefixes.