Variable-Length Integers
From CodeCodex
There is often a need to pass integer values through a bytestream—for instance, over a network connection, or writing to and reading from a file. This can be done as a sequence of ASCII decimal digits, but sometimes a more compact binary representation is desirable. It can also be useful if the representation is not fixed-length; this way, smaller values can be encoded to take up less space, while larger values can still be represented as needed.
Existing Methods
One common technique is that used in Standard MIDI files. In this format, the bits of the integer are split into groups of 7, each encoded into a byte, where the eighth bit is set in every byte except the last.
Another technique is UTF-8. In this one, the first byte encodes the number of bytes that follow, together with the most significant few bits of the integer.
Both techniques have redundant encodings—that is, it is possible to encode some or all representable integer values in more than one way. UTF-8 adds an extra rule to declare these redundant encodings illegal, but they still exist.
The New Technique
The technique proposed here is a variation of that used in Standard MIDI files. It involves subtracting an offset every time a group of 7 bits is shifted out for encoding in another byte, and adding this back in again on decoding, so that there are no longer any redundant encodings: each integer is representable in exactly one way, and each representation stands for exactly one integer.
The following Python code implements the encoding and decoding algorithm:
class varint :
"""variable-length unsigned integer encoding"""
@staticmethod
def encode(N, Put) :
"""encodes integer N into variable-length format and
feeds the unsigned integer byte values to Put."""
while True :
ThisByte = N & 127
N >>= 7
if N == 0 :
Put(ThisByte)
break
#end if
Put(ThisByte | 128)
N -= 1
#end while
#end encode
@staticmethod
def decode(Get) :
"""reads a sequence of unsigned integer byte values by calling
Get, decodes them into an integer in variable-length format
and returns it as the function result."""
N = 0
Shift = 1
while True :
ThisByte = Get()
N += (ThisByte & 127) * Shift
if (ThisByte & 128) == 0 :
break
Shift <<= 7
N += Shift
#end while
return \
N
#end decode
#end varint
Note this code is written to use callbacks that implement the actual reading and writing of the bytestream, to avoid any assumptions about how exactly the stream is implemented. For instance, here's a simple “string-stream” class that lets you accumulate output to a string, or extract input from a string:
class string_stream :
"""I/O to/from a list of characters."""
def __init__(self, Initial = "") :
self.Contents = list(Initial)
#end __init__
def Get(self) :
"""returns the next available byte from the string."""
if len(self.Contents) == 0 :
raise IOError("get past end of string")
#end if
Ch = ord(self.Contents[0])
del self.Contents[0]
# not designed for efficiency, but what the hell
return \
Ch
#end Get
def Put(self, Ch) :
"""appends another byte to the string."""
self.Contents.append(chr(Ch))
#end Put
def AssertEmpty(self) :
"""asserts that the string has no contents."""
if len(self.Contents) != 0 :
raise IOError("contents not empty")
#end if
#end AssertEmpty
#end string_stream
And here's how you would use this to test the varint class:
for ThisInt in \
range(0, 3) + range(126, 131) \
:
Stream = string_stream()
varint.encode(ThisInt, Stream.Put)
Encoded = str(Stream)
ThatInt = varint.decode(Stream.Get)
Stream.AssertEmpty()
if ThisInt == ThatInt :
Msg = ""
else :
Msg = " (WRONG)"
#end if
sys.stdout.write \
(
"%d encodes to %s decodes to %d%s\n"
%
(ThisInt, Encoded, ThatInt, Msg)
)
#end for