Variable-Length Integers

From CodeCodex

There is often a need to pass integer values through a bytestream—for instance, over a network connection, or writing to and reading from a file. This can be done as a sequence of ASCII decimal digits, but sometimes a more compact binary representation is desirable. It can also be useful if the representation is not fixed-length; this way, smaller values can be encoded to take up less space, while larger values can still be represented as needed.

Existing Methods[edit]

One common technique is that used in Standard MIDI files. In this format, the bits of the integer are split into groups of 7, each encoded into a byte, where the eighth bit is set in every byte except the last.

Another technique is UTF-8. In this one, the first byte encodes the number of bytes that follow, together with the most significant few bits of the integer.

Both techniques have redundant encodings—that is, it is possible to encode some or all representable integer values in more than one way. UTF-8 adds an extra rule to declare these redundant encodings illegal, but they still exist.

The New Technique[edit]

The technique proposed here is a variation of that used in Standard MIDI files. It involves subtracting an offset every time a group of 7 bits is shifted out for encoding in another byte, and adding this back in again on decoding, so that there are no longer any redundant encodings: each integer is representable in exactly one way, and each representation stands for exactly one integer.

The following Python code implements the encoding and decoding algorithm:

class varint :
    """variable-length unsigned integer encoding"""

    def encode(N, Put) :
        """encodes integer N into variable-length format and
        feeds the unsigned integer byte values to Put."""
        while True :
            ThisByte = N & 127
            N >>= 7
            if N == 0 :
            #end if
            Put(ThisByte | 128)
            N -= 1
        #end while
    #end encode

    def decode(Get) :
        """reads a sequence of unsigned integer byte values by calling
        Get, decodes them into an integer in variable-length format
        and returns it as the function result."""
        N = 0
        Shift = 1
        while True :
            ThisByte = Get()
            N += (ThisByte & 127) * Shift
            if (ThisByte & 128) == 0 :
            Shift <<= 7
            N += Shift
        #end while
        return \
    #end decode

#end varint

Note this code is written to use callbacks that implement the actual reading and writing of the bytestream, to avoid any assumptions about how exactly the stream is implemented. For instance, here's a simple “string-stream” class that lets you accumulate output to a string, or extract input from a string:

class string_stream :
    """I/O to/from a list of characters."""

    def __init__(self, Initial = "") :
        self.Contents = list(Initial)
    #end __init__

    def Get(self) :
        """returns the next available byte from the string."""
        if len(self.Contents) == 0 :
            raise IOError("get past end of string")
        #end if
        Ch = ord(self.Contents[0])
        del self.Contents[0]
        # not designed for efficiency, but what the hell
        return \
    #end Get

    def Put(self, Ch) :
        """appends another byte to the string."""
    #end Put

    def AssertEmpty(self) :
        """asserts that the string has no contents."""
        if len(self.Contents) != 0 :
            raise IOError("contents not empty")
        #end if
    #end AssertEmpty

#end string_stream

And here's how you would use this to test the varint class:

for ThisInt in \
    range(0, 3) + range(126, 131) \
    Stream = string_stream()
    varint.encode(ThisInt, Stream.Put)
    Encoded = str(Stream)
    ThatInt = varint.decode(Stream.Get)
    if ThisInt == ThatInt :
        Msg = ""
    else :
        Msg = " (WRONG)"
    #end if
    sys.stdout.write \
            "%d encodes to %s decodes to %d%s\n"
            (ThisInt, Encoded, ThatInt, Msg)
#end for