Saturday, January 17, 2015

Binary-file Grammar

Binary-file Grammar

This is going to be a bit of a specific post about a filetype that only a small subset of mostly electron microscopists use. Like all file formats for important data, not being able to at least read them outside of a proprietary program is the kind of thing that keeps researchers awake at night. Also, in trying to find an elegant way to both read and write the files, I ended up designing a more general binary-file grammar and parser that (maybe) could be expanded to other filetypes.

I've actually written a few dm3 file parsers over the years, for some reason I quite enjoy reverse engineering file formats so I haven't minded repeating myself so much. Especially as each time I think 'This'll be the time I finally find an elegant solution for doing this'. Each time I start with the best intentions, a few functions, lots of code reuse, but then by the time the edge cases are sorted out, I'm left with a mess of functions that I have to teach myself again each time I see the code.

So slightly inspired by the amazingly simple grammar parser in the udacity design of computer programs course, and slightly dissuaded by the more complete but complicated monadic parsing, I decided to give it yet another go, but this time starting with a (almost) human readable description of the file that would be the only thing the parser then needed. I wanted something that served as both a source for the parser, and a concise description of how the file is actually structured. The problem is then split into two (or three) parts: 1. Writing a general grammar parser and then 2. writing a grammar for the specific filetype I want to read and write (and then probably another step 3. converting from the syntax tree produced by the parser into a more usable format).


The grammar consists of definitions with each one made up of other definitions or else a struct type that can be read straight from the file (it's actually slightly more complicated than this, and can also be something that evaluates to the name of a definition, or an array of names of definitions). Here's a simple example of the grammar:

atom: len(<l)=0
atom: len(<l), string({len}s), atom 

It describes a single element, atom, that can be stored in two different ways, both starting with the '<l' type (from python struct class, a little-endian 32-bit int). If it's equal to zero it matches the first atom description and ends there. If non-zero it is followed by that number of characters, followed by another atom element. In both cases the length parameter is read into a variable called len, and subsequent elements get format'ed with any preceding values. The output from the parser (or input if writing) would be something like

{'len': 5, 'string': 'Hello' atom: 
    {'len': 6, 'string': 'World!', atom: 
        {len: 0}}}

(I'm using definitions with identical names as a way to provide options to the parser. The other (probably better) way would be to use, say:

zeroatom: len(<l)=0
nonzeroatom: len(<l), string({len}s), zeroatom | nonzeroatom

Then the two types would be differentiated in the tree returned, rather than having to know the difference between the two types to know which one you have. But I'm sticking with this way as it works and makes lists of heterogeneous data slightly easier.)

There are two magic values that get inserted automatically by the parser: 'parent', which contains the values from the parent of the current item, and 'f' which gives access to the python file-object (it's currently only used as 'f.tell()' to find the current file position, so this could be replaced with just 'position' in the future). 

DM3 Grammar

The DM3 grammar I've created is a lot more complicated than the example above. A description of the filetype can be found here, and I hope that the grammar itself can provide the information almost as easily.

Here's the grammar for the DM3 files:

dm3_grammar = """
header:     version(>l)=3, len(>l), endianness(>l)=1, _pos=f.tell(), section, 
            len=f.tell()-_pos+4, zero_pad_0(>l)=0, zero_pad_1(>l)=0

section:    is_dict(b), open(b), num_tags(>l), data(["named_data"]*num_tags)

named_data: sdtype(b)=20, name_length(>H), name({name_length}s), 

named_data: sdtype(b)=20, name_length(>H), name({name_length}s), 

# struct-specific data entry
dataheader: delim(4s)="%%%%", headerlen(>l), _pos=f.tell(),
            dtype(>l)=15, struct_header, 
            headerlen=(f.tell()-_pos)/4, struct_data

# array-specific data entry
dataheader: delim(4s)="%%%%", headerlen(>l), _pos=f.tell(),
            dtype(>l)=20, array_data,

# simple data
dataheader: delim(4s)="%%%%", headerlen(>l), _pos=f.tell(), 
            headerlen=(f.tell()-_pos)/4, data(simpledata_{dtype})

simpledata_2 = h
simpledata_3 = i
simpledata_4 = H
simpledata_5 = I
simpledata_6 = f
simpledata_7 = d
simpledata_8 = b
simpledata_9 = b
simpledata_10 = b
simpledata_11 = q
simpledata_12 = Q

struct_header: length(>l)=0, num_fields(>l),

struct_data: data([("simpledata_%s" % dtypes.dtype) for dtypes in parent.struct_header.types])

struct_dtype: length(>l)=0, dtype(>l)

array_data: arraydtype(>l)=15, struct_header, len(>l),
            _end=f.tell(), array(["struct_data"]*len)

#general case:
array_data: arraydtype(>l), len(>l),
            _end=f.tell(), array("{len}"+simpledata_{arraydtype})


This has a few features not described in the simple example above. Constants can be defined using name=value, as is done for the simpledata_x values. Items beginning with an underscore, '_', are treated as variables, with the value on the right hand side being evaluated and stored in the variable. It's also possible to refer to an already defined element and either confirm the value is what is expected (when reading, or when writing and a value is specified) or else have it be set to the expected value (when writing and no value is provided). For example the 'header' atom has the following:

..., len(>l), endianness(>l)=1, _pos=f.tell(), section, len=f.tell()-_pos+4, ...

In this example, the first definition of len tells the parser the location and type of  len, then a bit later f.tell() is evaluated and put into _pos, and then later len is checked to be equal to f.tell() at a later point, (minus _pos plus 4). When writing, if len is not specified, the same process takes place only this time len is calculated from the expected value and written back into the location of len when it gets calculated. This allows the parser to automatically fill in values to their expected values when writing, making things a lot easier.

One thing it does which might give grammar designers nightmares, is allow access to higher elements in the tree too. The struct_data item needs parent.struct_header.types to exist ('parent' is automatically set by the parser each time it parses a new item) which it luckily is, in the two times we need it.

Conclusion (and DM4 files)

So is it worth going to all this trouble (writing a parser, and designing a grammar, then writing a version of it for the file and then having to massage the data out of a syntax-tree etc, rather than just parsing the thing manually)? I'm not totally sure, but I do feel as though if I ever need to remind myself how the file is structured I will refer to the grammar first as a comprehensive description of the file structure.

Also, the DM3 grammar I've shown above isn't the version I'm using now. There's also a DM4 format whose main purpose is to support 64 bit sizes. I'm actually converting a general grammar into two specific versions for both DM3 and DM4 files which means I can support both with only very minimal changes. I find and replace some fields in the general grammar as follows:

dm3_grammar_defs = { 'version': 3,
                     'len_type': '>l',
                     'len_size': 4,
                     'header_offset': 4,
                     'named_data_pre': '',
                     'named_data_post': '',

dm4_grammar_defs = { 'version': 4,
                     'len_type': '>Q',
                     'len_size': 8,
                     'header_offset': 0,
                     'named_data_pre': 'datalen(>Q), _pos=f.tell(), ',
                     'named_data_post': ', datalen=f.tell()-_pos',

This I think really shows the flexibility of using a grammar. With a general grammar and a small number of changes it's possible to describe two different filetypes and provide reading and writing capabilities for those files.

I've split this project into two projects on GitHub. There's a FileGrammar and then a separate DM3Utils project that depends upon it.

Tuesday, January 13, 2015


Melanosomes, uh?

While at a Magnetic Fields' concert a few years ago, one of the members of the band, as an example of just how rock'n'roll their lifestyle was, pointed out that the word exit is composed only of valid 2-letter scrabble words (ex, xi and it).

I've always been wondering if there were more words like this, mainly as a tool to help me remember the 105 two letter scrabble words. It's quite a simple task to do in Python, and the longest such words are denominator, kinetosomes, melanosomes and somatomedin. There's 1890 words in total (not including the two letter words themsleves) and a few more interesting words are sinusoidal, bonefishes, monoxides and tomatoes. The only two-letter word not to appear in any longer word is uh. Trying to find the shortest number of words needed to include all two-letter words is slightly harder, and the best I came up with was the following 30 words.


To be honest, it's probably easier to remember the 105.

It's also interesting to try the same thing with three-letter words. There's a lot more of them (997, with a lot less overlap -- I could only reduce them to 742 longer words) , but there are some interesting composite words, such as outskatedoperated and monorail.

There's also 509 words that act as both such as shadowed, honored and panamas.

The code for finding the words can be found on Github.