Saturday, January 17, 2015

Binary-file Grammar

Binary-file Grammar

This is going to be a bit of a specific post about a filetype that only a small subset of mostly electron microscopists use. Like all file formats for important data, not being able to at least read them outside of a proprietary program is the kind of thing that keeps researchers awake at night. Also, in trying to find an elegant way to both read and write the files, I ended up designing a more general binary-file grammar and parser that (maybe) could be expanded to other filetypes.

I've actually written a few dm3 file parsers over the years, for some reason I quite enjoy reverse engineering file formats so I haven't minded repeating myself so much. Especially as each time I think 'This'll be the time I finally find an elegant solution for doing this'. Each time I start with the best intentions, a few functions, lots of code reuse, but then by the time the edge cases are sorted out, I'm left with a mess of functions that I have to teach myself again each time I see the code.

So slightly inspired by the amazingly simple grammar parser in the udacity design of computer programs course, and slightly dissuaded by the more complete but complicated monadic parsing, I decided to give it yet another go, but this time starting with a (almost) human readable description of the file that would be the only thing the parser then needed. I wanted something that served as both a source for the parser, and a concise description of how the file is actually structured. The problem is then split into two (or three) parts: 1. Writing a general grammar parser and then 2. writing a grammar for the specific filetype I want to read and write (and then probably another step 3. converting from the syntax tree produced by the parser into a more usable format).


The grammar consists of definitions with each one made up of other definitions or else a struct type that can be read straight from the file (it's actually slightly more complicated than this, and can also be something that evaluates to the name of a definition, or an array of names of definitions). Here's a simple example of the grammar:

atom: len(<l)=0
atom: len(<l), string({len}s), atom 

It describes a single element, atom, that can be stored in two different ways, both starting with the '<l' type (from python struct class, a little-endian 32-bit int). If it's equal to zero it matches the first atom description and ends there. If non-zero it is followed by that number of characters, followed by another atom element. In both cases the length parameter is read into a variable called len, and subsequent elements get format'ed with any preceding values. The output from the parser (or input if writing) would be something like

{'len': 5, 'string': 'Hello' atom: 
    {'len': 6, 'string': 'World!', atom: 
        {len: 0}}}

(I'm using definitions with identical names as a way to provide options to the parser. The other (probably better) way would be to use, say:

zeroatom: len(<l)=0
nonzeroatom: len(<l), string({len}s), zeroatom | nonzeroatom

Then the two types would be differentiated in the tree returned, rather than having to know the difference between the two types to know which one you have. But I'm sticking with this way as it works and makes lists of heterogeneous data slightly easier.)

There are two magic values that get inserted automatically by the parser: 'parent', which contains the values from the parent of the current item, and 'f' which gives access to the python file-object (it's currently only used as 'f.tell()' to find the current file position, so this could be replaced with just 'position' in the future). 

DM3 Grammar

The DM3 grammar I've created is a lot more complicated than the example above. A description of the filetype can be found here, and I hope that the grammar itself can provide the information almost as easily.

Here's the grammar for the DM3 files:

dm3_grammar = """
header:     version(>l)=3, len(>l), endianness(>l)=1, _pos=f.tell(), section, 
            len=f.tell()-_pos+4, zero_pad_0(>l)=0, zero_pad_1(>l)=0

section:    is_dict(b), open(b), num_tags(>l), data(["named_data"]*num_tags)

named_data: sdtype(b)=20, name_length(>H), name({name_length}s), 

named_data: sdtype(b)=20, name_length(>H), name({name_length}s), 

# struct-specific data entry
dataheader: delim(4s)="%%%%", headerlen(>l), _pos=f.tell(),
            dtype(>l)=15, struct_header, 
            headerlen=(f.tell()-_pos)/4, struct_data

# array-specific data entry
dataheader: delim(4s)="%%%%", headerlen(>l), _pos=f.tell(),
            dtype(>l)=20, array_data,

# simple data
dataheader: delim(4s)="%%%%", headerlen(>l), _pos=f.tell(), 
            headerlen=(f.tell()-_pos)/4, data(simpledata_{dtype})

simpledata_2 = h
simpledata_3 = i
simpledata_4 = H
simpledata_5 = I
simpledata_6 = f
simpledata_7 = d
simpledata_8 = b
simpledata_9 = b
simpledata_10 = b
simpledata_11 = q
simpledata_12 = Q

struct_header: length(>l)=0, num_fields(>l),

struct_data: data([("simpledata_%s" % dtypes.dtype) for dtypes in parent.struct_header.types])

struct_dtype: length(>l)=0, dtype(>l)

array_data: arraydtype(>l)=15, struct_header, len(>l),
            _end=f.tell(), array(["struct_data"]*len)

#general case:
array_data: arraydtype(>l), len(>l),
            _end=f.tell(), array("{len}"+simpledata_{arraydtype})


This has a few features not described in the simple example above. Constants can be defined using name=value, as is done for the simpledata_x values. Items beginning with an underscore, '_', are treated as variables, with the value on the right hand side being evaluated and stored in the variable. It's also possible to refer to an already defined element and either confirm the value is what is expected (when reading, or when writing and a value is specified) or else have it be set to the expected value (when writing and no value is provided). For example the 'header' atom has the following:

..., len(>l), endianness(>l)=1, _pos=f.tell(), section, len=f.tell()-_pos+4, ...

In this example, the first definition of len tells the parser the location and type of  len, then a bit later f.tell() is evaluated and put into _pos, and then later len is checked to be equal to f.tell() at a later point, (minus _pos plus 4). When writing, if len is not specified, the same process takes place only this time len is calculated from the expected value and written back into the location of len when it gets calculated. This allows the parser to automatically fill in values to their expected values when writing, making things a lot easier.

One thing it does which might give grammar designers nightmares, is allow access to higher elements in the tree too. The struct_data item needs parent.struct_header.types to exist ('parent' is automatically set by the parser each time it parses a new item) which it luckily is, in the two times we need it.

Conclusion (and DM4 files)

So is it worth going to all this trouble (writing a parser, and designing a grammar, then writing a version of it for the file and then having to massage the data out of a syntax-tree etc, rather than just parsing the thing manually)? I'm not totally sure, but I do feel as though if I ever need to remind myself how the file is structured I will refer to the grammar first as a comprehensive description of the file structure.

Also, the DM3 grammar I've shown above isn't the version I'm using now. There's also a DM4 format whose main purpose is to support 64 bit sizes. I'm actually converting a general grammar into two specific versions for both DM3 and DM4 files which means I can support both with only very minimal changes. I find and replace some fields in the general grammar as follows:

dm3_grammar_defs = { 'version': 3,
                     'len_type': '>l',
                     'len_size': 4,
                     'header_offset': 4,
                     'named_data_pre': '',
                     'named_data_post': '',

dm4_grammar_defs = { 'version': 4,
                     'len_type': '>Q',
                     'len_size': 8,
                     'header_offset': 0,
                     'named_data_pre': 'datalen(>Q), _pos=f.tell(), ',
                     'named_data_post': ', datalen=f.tell()-_pos',

This I think really shows the flexibility of using a grammar. With a general grammar and a small number of changes it's possible to describe two different filetypes and provide reading and writing capabilities for those files.

I've split this project into two projects on GitHub. There's a FileGrammar and then a separate DM3Utils project that depends upon it.

No comments:

Post a Comment