Reading Interactive Analysis of Web-Scale Datasets paper, I bumped into the concept of repetition and definition level.
while I understand the need for these two, to be able to disambiguate occurrences, it attaches a repetition and definition level to each value.
What is unclear to me is how they computed the levels...
It says:
Consider field Code in Figure 2. It occurs three times in r1. Occurrences ‘en-us’ and ‘en’ are inside the first Name, while ’en-gb’ is in the third Name. To disambiguate these occurrences, we attach a repetition level to each value. It tells us at what repeated field in the field’s path the value has repeated.
The field path Name.Language.Code contains two repeated fields, Name and Language. Hence, the repetition level of Code ranges between 0 and 2; level 0 denotes the start of a new record. Now suppose we are scanning record r1 top down. When we encounter ‘en-us’, we have not seen any repeated fields, i.e., the repetition level is 0. When we see ‘en’, field Language has repeated, so the repetitionlevelis2.
I just can't get me head around it, Name.Language.Code
in r1
has en-us
and en
values. While is the first one r = 0
and the second one r = 2
is it because two definitions were repeated ? (language and code) ?
If it was:
Name
Language
Code: en-us
Name
Language
Code: en
Name
Language
Code: en-gb
Would it be ?
0 2
1 2
2 2
Definition levels. Each value of a field with path p, esp. every NULL, has a definition level specifying how many fields in p that could be undefined (because they are optional or repeated) are actually present in record.
Why is then the definition level is 2 ?
Isn't the path Name.Language
contain two fields Code
and Country
where only 1 is optional\repeated
?
The Dremel striping algorithm is by no means trivial.
To answer your first question:
The repetition level of en-us
is 0 since it is the first occurrence of a name.language.code
path within the record.
The repetition level of en
is 2, since the repetition occurred at level 2 (the language tag).
To answer your second question, for the following record,
DocId: 20
Name
Language
Code: en-us
Name
Language
Code: en
Name
Language
Code: en-gb
the entries for name.language.code
would be
en-us 0 2
en 1 2
en-gb 1 2
Explanation:
name
and language
are present. en-us
is zero, since it is the first name.language.code
within the record. en
and en-gb
is 1, since the repetition occurred at the name
tag (level 1).