Group VarInt
In Chapter 6, we discussed the byte-aligned vByte method as an example
of an index compression technique with high decoding performance. After
we had written the chapter,
Jeff Dean
gave a
keynote talk
at WSDM 2009 in which he discussed a few compression algorithms used by
Google, including the byte-aligned
Group VarInt method.
Group VarInt is named after VarInt (variable-size integer coding), Google's
version of the vByte algorithm. vByte's decoding routine is very efficient,
but often suffers from branch misprediction. Consider the
vByte representation of the postings list
L = 〈80, 400, 431, 686〉:
Δ(L) |
= |
〈80, 320, 31, 255〉 |
vByte(L) |
= |
01010000
11000000 00000010
00011111
11111111 00000001
|
(where the continuation bits are underlined).
When working through the encoded postings sequence, the decoder needs to
inspect every single continuation bit and, depending on whether it is 1 or 0,
either continue with the current posting or skip to the next. This means
there is a branch instruction, and thus the possibility of a branch
misprediction, for every byte in the encoded postings sequence.
Group VarInt addresses this problem in two steps:
-
Replacing the continuation bits with a 2-bit selector value that
appears before the body of the encoded posting. This allows us
to use simple arithmetic (or a lookup table) to determine how
many bits to consume for the current posting.
-
Combining four consecutive postings into a group for better space
utilization and alignment of the selecotr values (or to reduce the
number of table lookups in case a lookup table is used).
For the same postings list as before, the Group VarInt representation is:
GroupVarInt(L) =
00010000
0101000
01000000 00000001
00011111
11111111
(the selector bits corresponding to each encoded posting
are shown in the same color as the respective posting). For example,
the second Δ-value, 320, can be encoded using 2 bytes, so its
selector value is 01.
Publicly available implementations of Group VarInt typically use
a precomputed lookup table to determine the position of each posting
in the encoded data chunk and the bit mask to apply when extracting
it. In our own experiments, we found that a little bit shifting works
just as well.
The following decoder implementation assumes a little-endian
architecture that supports unaligned memory access (e.g.,
Intel/AMD); on other architectures, the decoding operations may
be slightly more complicated.
void DecodeGroupVarInt(const byte* compressed, int size, uint32_t* uncompressed) {
const uint32_t MASK[4] = { 0xFF, 0xFFFF, 0xFFFFFF, 0xFFFFFFFF };
const byte* limit = compressed + size;
uint32_t current_value = 0;
while (compressed != limit) {
const uint32_t selector = *compressed++;
const uint32_t selector1 = (selector & 3);
current_value += *((uint32_t*)(compressed)) & MASK[selector1];
*uncompressed++ = current_value;
compressed += selector1 + 1;
const uint32_t selector2 = ((selector >> 2) & 3);
current_value += *((uint32_t*)(compressed)) & MASK[selector2];
*uncompressed++ = current_value;
compressed += selector2 + 1;
const uint32_t selector3 = ((selector >> 4) & 3);
current_value += *((uint32_t*)(compressed)) & MASK[selector3];
*uncompressed++ = current_value;
compressed += selector3 + 1;
const uint32_t selector4 = (selector >> 6);
current_value += *((uint32_t*)(compressed)) & MASK[selector4];
*uncompressed++ = current_value;
compressed += selector4 + 1;
}
}
|
If the number of postings in the compressed list is not a multiple
of 4, then special-case handling may be necessary for the final
1–3 postings in the list — for instance, by encoding
those postings using vByte.
Decoding Performance
The following table is identical with Table 6.9 (page 213)
in the book, except for the addition of Group VarInt:
|
Compression
|
Decoding
|
Cumulative Overhead
|
Compression Method
|
(bits per docid)
|
(ns per docid)
|
(decoding + disk I/O)
|
Gamma | 4.56 | 7.73 | 13.98 ns
|
Golomb | 3.78 | 10.85 | 16.03 ns
|
Rice | 3.81 | 6.46 | 11.68 ns
|
LLRUN | 3.89 | 7.04 | 12.37 ns
|
Interpolative | 3.87 | 26.50 | 31.80 ns
|
vByte | 8.11 | 1.39 | 12.50 ns
|
Simple-9 | 4.91 | 2.76 | 9.49 ns
|
Group VarInt | 10.45 | 1.09 | 14.84 ns
|
Perhaps surprisingly, the performance difference between vByte and
Group VarInt in the above table is very small. The reason for this is
that the vast majority of Δ-values in the docid index for GOV2
require 7 bits or less, in which case vByte incurs no branch
mispredictions. To evaluate the performance difference between vByte
and Group VarInt on an index in which the Δ-values vary over a
larger range, we repeated the experiment from Table 6.9 for a
schema-independent index instead of a docid index:
|
Compression
|
Decoding
|
Cumulative Overhead
|
Compression Method
|
(bits per position)
|
(ns per position)
|
(decoding + disk I/O)
|
Gamma | 14.09 | 12.81 | 32.11 ns
|
Golomb | 12.14 | 12.74 | 29.37 ns
|
Rice | 12.41 | 7.32 | 24.32 ns
|
LLRUN | 10.07 | 8.80 | 22.60 ns
|
Interpolative | 10.34 | 31.63 | 45.80 ns
|
vByte | 12.03 | 4.34 | 20.82 ns
|
Simple-9 | 13.14 | 4.21 | 22.21 ns
|
Group VarInt | 13.10 | 1.90 | 19.85 ns
|
Compared to the docid index, the decoding overhead for vByte increases
by 3.0 ns per list element (+221%). The decoding overhead for Group VarInt,
on the other hand, grows by only 0.8 ns per list element (+76%).
Thus, while vByte may be an appropriate compression method for docid
indices, since it is almost as fast as Group VarInt and slightly more
compact, it is not ideal for schema-independent postings lists, where
Group VarInt delivers much better decoding performance. If postings
lists are stored on disk, one may still argue that vByte is the superior
compression method, as it achieves better compression rates. However,
if the index is kept in memory, then the factor-2.3 difference in
decoding performance makes Group VarInt the clear winner.