There is often a need to pass integer values through a bytestream—for instance, over a network connection, or writing to and reading from a file. This can be done as a sequence of ASCII decimal digits, but sometimes a more compact binary representation is desirable. It can also be useful if the representation is not fixed-length; this way, smaller values can be encoded to take up less space, while larger values can still be represented as needed.
One common technique is that used in Standard MIDI files. In this format, the bits of the integer are split into groups of 7, each encoded into a byte, where the eighth bit is set in every byte except the last.
Another technique is UTF-8. In this one, the first byte encodes the number of bytes that follow, together with the most significant few bits of the integer.
Both techniques have redundant encodings—that is, it is possible to encode some or all representable integer values in more than one way. UTF-8 adds an extra rule to declare these redundant encodings illegal, but they still exist.
The New Technique
The technique proposed here is a variation of that used in Standard MIDI files. It involves subtracting an offset every time a group of 7 bits is shifted out for encoding in another byte, and adding this back in again on decoding, so that there are no longer any redundant encodings: each integer is representable in exactly one way, and each representation stands for exactly one integer.
The following Python code implements the encoding and decoding algorithm:
class varint : """variable-length unsigned integer encoding""" @staticmethod def encode(N, Put) : """encodes integer N into variable-length format and feeds the unsigned integer byte values to Put.""" while True : ThisByte = N & 127 N >>= 7 if N == 0 : Put(ThisByte) break #end if Put(ThisByte | 128) N -= 1 #end while #end encode @staticmethod def decode(Get) : """reads a sequence of unsigned integer byte values by calling Get, decodes them into an integer in variable-length format and returns it as the function result.""" N = 0 Shift = 1 while True : ThisByte = Get() N += (ThisByte & 127) * Shift if (ThisByte & 128) == 0 : break Shift <<= 7 N += Shift #end while return \ N #end decode #end varint
Note this code is written to use callbacks that implement the actual reading and writing of the bytestream, to avoid any assumptions about how exactly the stream is implemented. For instance, here's a simple “string-stream” class that lets you accumulate output to a string, or extract input from a string:
class string_stream : """I/O to/from a list of characters.""" def __init__(self, Initial = "") : self.Contents = list(Initial) #end __init__ def Get(self) : """returns the next available byte from the string.""" if len(self.Contents) == 0 : raise IOError("get past end of string") #end if Ch = ord(self.Contents) del self.Contents # not designed for efficiency, but what the hell return \ Ch #end Get def Put(self, Ch) : """appends another byte to the string.""" self.Contents.append(chr(Ch)) #end Put def AssertEmpty(self) : """asserts that the string has no contents.""" if len(self.Contents) != 0 : raise IOError("contents not empty") #end if #end AssertEmpty #end string_stream
And here's how you would use this to test the varint class:
for ThisInt in \ range(0, 3) + range(126, 131) \ : Stream = string_stream() varint.encode(ThisInt, Stream.Put) Encoded = str(Stream) ThatInt = varint.decode(Stream.Get) Stream.AssertEmpty() if ThisInt == ThatInt : Msg = "" else : Msg = " (WRONG)" #end if sys.stdout.write \ ( "%d encodes to %s decodes to %d%s\n" % (ThisInt, Encoded, ThatInt, Msg) ) #end for