Simple Chunk Protocol Format

From CodeCodex

I have often had the need to set up custom communication protocols between client and server processes. I decided early on that I needed to settle on a metaformat that could be easily customized for a particular application, and also easily extended as the needs of that application evolved.

I decided to base this metaformat on the old IFF concept from the Commodore-Amiga days. IFF is build on the concept of chunks, and has some elaborate conventions (such as FORM chunks) for embedding chunks within other chunks to construct more complex structures. I didn't bother with these additional elaborations, though of course you can adopt them if you want.

A chunk begins with a fixed-length four-byte ID code. Usually this is four printable ASCII characters, with some mnemonic value for indicating the purpose of the chunk. This is followed by a four-byte integer indicating the length of the chunk contents (which can be 0). After this is the contents of the chunk itself. At the top level, the chunk ID is used to indicate the kind of request being made of the server (or the kind of response being returned to the client); depending on the need, the chunk contents could in turn be made up of sub-chunks containing various parameters of the request or response.

There are two other differences between my spec and the original IFF spec: chunks are not aligned on anything greater than a byte boundary, and chunk lengths are passed in little-endian byte order.

Python[edit]

The following chunk-manipulation routines are all defined as static methods of a utility class called chunk. First, a routine to construct a chunk, given the chunk ID and chunk contents:

import struct

class chunk :

    @staticmethod
    def make(ID, Contents) :
        """constructs a chunk with the specified ID and contents."""
        return \
                struct.pack("<4sI", ID[0:4], len(Contents)) \
            + \
                str(Contents)
    #end make

And a routine to do the reverse, parsing a chunk into its ID and contents, and also whatever leftover bytes may come after:

# continuation of class chunk :

    @staticmethod
    def extract(Data) :
        """parses a chunk into its ID, contents, and whatever comes after.
        Returns None if Data cannot be parsed."""
        if len(Data) >= 8 :
            (ID, Len) = struct.unpack("<4sI", Data[0 : 8])
            if Len <= len(Data) - 8 :
                Result = (ID, Data[8 : Len + 8], Data[Len + 8:])
            else :
                Result = None
            #end if
        else :
            Result = None
        #end if
        return Result
    #end extract

Note that, if the content is too short for the specified length, chunk.extract returns None. This kind of validation becomes important when you're receiving data via a network over which you have no control.

A simple example use of the above routines:

>>> c = chunk.make("DATA", "Hi There")
>>> c
'DATA\x08\x00\x00\x00Hi There'
>>> chunk.extract(c)
('DATA', 'Hi There', '')

Why does chunk.extract return the leftover bytes following the chunk? So that you can easily construct and parse chunks made up of sequences of chunks. The final routine will decode such a chunk sequence:

# continuation of class chunk :

    @staticmethod
    def extract_sequence(Data) :
        """parses Data into a sequence of pairs of chunk IDs and
        contents."""
        Result = []
        while True :
            Items = chunk.extract(Data)
            if Items == None :
                break
            Result.append([Items[0], Items[1]])
            Data = Items[2]
        #end while
        return Result
    #end extract_sequence

#end chunk

An example use:

>>> c = chunk.make("COPY", chunk.make("FROM", "here") + chunk.make("TO  ", "there"))
>>> c
'COPY\x19\x00\x00\x00FROM\x04\x00\x00\x00hereTO  \x05\x00\x00\x00there'
>>> chunk.extract(c)
('COPY', 'FROM\x04\x00\x00\x00hereTO \x05\x00\x00\x00there', '')
>>> chunk.extract_sequence(chunk.extract(c)[1])
[['FROM', 'here'], ['TO  ', 'there']]

Usually, the order of the chunks is not important, since they are identified by ID. To find the sub-chunks by ID, turn the result into a dictionary:

>>> d = dict(chunk.extract_sequence(chunk.extract(c)[1]))
>>> d['FROM']
'here'
>>> d['TO  ']
'there'

Receiving A Chunk[edit]

To illustrate reading a chunk over a network connection, first I define a utility routine that guarantees to read the specified number of bytes from a socket:

def receive_all(from_socket, n) :
    """reads n bytes from from_socket, raising an exception if
    EOF reached."""
    result = ""
    while True :
        if n == 0 :
            break
        data = from_socket.recv(n)
        if len(data) == 0 :
            raise IOError("EOF on socket")
            # or disable above to return None instead
            result = None
            break
        #end if
        result += data
        n -= len(data)
    #end while
    return result
#end receive_all

Now, assuming a network connection socket is opened and stored in the variable the_connection, reading a chunk is as simple as:

header = receive_all(the_connection, 8)
the_complete_chunk = header + receive_all \
  (
    the_connection,
    struct.unpack("<4sI", header)[1]
  )

Of course, this may be adequate for a single-threaded client which can block until the data is received, but a server will typically need to be written to handle requests coming in from multiple clients simultaneously. For the moment, writing code to handle such situations will be left as an exercise for the reader. ☻