Recently, I discovered that the Go standard library includes a built-in implementation of varint
, found in encoding/binary/varint.go. This implementation is similar to the varint
used in protobuf
. Using the Golang standard library's varint
source code, we will systematically learn and review the concept of varint
.
If you're familiar with protobuf
, you probably already know that all integer types (except fixed types like fixed32
and fixed64
) are encoded using varint
.
varint
mainly solves two issues:
Space Efficiency: Take
uint64
as an example, representing values as large as 18,446,744,073,709,551,615. In most real-world scenarios, however, our integer values are much smaller. If your system needs to process values as low as 1, you'd still use8 bytes
to represent this value in transmission, wasting space since most bytes store no useful data. varint encoding uses a variable-length byte sequence to represent integers, reducing the space required for smaller values.Compatibility:
varint
allows us to handle integers of different sizes without altering the encoding/decoding logic. This means fields can be upgraded from smaller types (likeuint32
) to larger ones (likeuint64
) without breaking backward compatibility.
This article will dive into Golang varint
implementation, exploring its design principles and how it addresses the challenges of encoding negative numbers.
This article was first published under the Medium MPP plan. Follow me on Medium if you're a Medium user.
The Design Principles of varint
varint
is designed based on simple principles:
- 7-bit Grouping: The binary representation of an integer is divided into 7-bit groups. From the least significant bit to the most significant bit, every 7-bit group becomes a unit.
-
Continuation Bit: A flag bit is added before each 7-bit group, forming an 8-bit byte. If more bytes follow, the flag bit is set to 1; otherwise, it’s set to 0.
For example, the integer
uint64(300)
has a binary representation of100101100
. Dividing this into two groups—10
and0101100
—and adding flag bits results in two bytes:00000010
and10101100
, which is thevarint
encoding of 300. Compared touint64
, which uses 4 bytes,varint
reduces the storage by 75%. list1: uint64 tovarint
func main() {
v := uint64(300)
bytes := make([]byte, 0)
bytes = binary.AppendUvarint(bytes, v)
fmt.Println(len(bytes))
for i := len(bytes); i > 0; i-- {
fmt.Printf("%08b ", bytes[i-1])
}
}
varint
for Unsigned Integers
The Go standard library provides two sets of varint
functions: one for unsigned integers (PutUvarint, Uvarint) and another for signed integers (varint, Putvarint).
Let's first look at the unsigned integer varint
implementation:
list2: go src PutUvarint
func PutUvarint(buf []byte, x uint64) int {
i := 0
for x >= 0x80 {
buf[i] = byte(x) | 0x80
x >>= 7
i++
}
buf[i] = byte(x)
return i + 1
}
There is a very important constant in the code: 0x80
, which corresponds to the binary code 1000 0000
. This constant is very important for the logic that follows:
-
x >= 0x80
: This checks ifx
requires more than 7 bits for representation. If it does,x
needs to be split. -
byte(x) | 0x80
: This applies a bitwise OR with0x80
(1000 0000
), ensuring the highest bit is set to 1 and extracting the lowest 7 bits ofx
. -
x >>= 7
: Shiftx
right by 7 bits to process the next group. -
buf[i] = byte(x)
: When the loop ends, the highest bits are all zeros, so no further action is needed.
Uvarint
is the reverse of PutUvarint
.
It should be noted that: varint
splits integers into 7-bit groups, meaning large integers may face inefficiencies. For example, uint64
's maximum value requires 10 bytes instead of the usual 8 (64/7 ≈ 10
).
Encoding Negative Numbers: Zigzag Encoding
Though varint
is efficient, it doesn’t account for negative numbers. In computing, numbers are stored as two’s complement, which means a small negative number might have a sizeable binary representation.
For example, -5
in 32-bit form is represented as 11111111111111111111111111111011
, requiring 5 bytes in varint encoding
Go uses zigzag encoding to solve this problem:
- For positive numbers
n
, map them to2n
. - For negative numbers
-n
, map them to2n-1
. This way, positive and negative numbers alternate without conflict, hence the namezigzag encoding
. For example, after zigzag encodingint32(-5)
, the value becomes 9 (00000000000000000000000000001001
), whichvarint
can represent with just 1 byte.
Here’s the Golang implementation:
list3: go src Putvarint
func Putvarint(buf []byte, x int64) int {
ux := uint64(x) << 1
if x < 0 {
ux = ^ux
}
return PutUvarint(buf, ux)
}
From the code, we can see that for the implementation of varint
for signed integers, the Go standard library breaks it down into two steps:
- First, the integer is converted using
ZigZag encoding
. - Then, the converted value is encoded using
varint
.
For negative numbers, there is an extra step: ux = ^ux
. This part might be confusing—why does this transformation result in 2n - 1
?
We can roughly deduce the process, assuming we have an integer -n
:
- First, the original value is shifted left, then inverted. This can be viewed as: first invert the value, then shift left, and finally add 1. This results in
2*(~(-n)) + 1
. - the two’s complement of a negative number is the bitwise inversion of its absolute value plus 1. So, how do we derive the absolute value from the two’s complement? There is a formula:
|A| = ~A + 1
. - Substituting this formula into the first step:
2*(n - 1) + 1 = 2n - 1
. This perfectly matches the ZigZag encoding for negative numbers (mathematics is indeed excellent).
In the Go standard library, calling PutUvarint
only applies varint
encoding, while calling PutVarint
first applies ZigZag encoding and then varint
encoding.
In protobuf
, if the type is int32
, int64
, uint32
, or uint64
, only varint
encoding is used. However, for sint32
and sint64
, ZigZag encoding is applied first, followed by varint
encoding.
When varint
Is Not Suitable
Despite its benefits, varint
isn't ideal for all scenarios:
-
Large integers:
varint
can be less efficient than fixed-length encoding for huge numbers. -
Random data access: Since
varint
uses variable lengths, indexing specific integers directly is challenging. -
Frequent mathematical operations:
varint-encoding
data requires decoding before operations, potentially affecting performance. -
Security-sensitive applications:
varint encoding
may leak information about the original integer's size, which could be unacceptable in secure environments.