Thursday, November 12, 2015

Pascal's Triangle and Fourier Transforms

One of the really useful properties of the Fourier transform is that convolution in one domain becomes multiplication in the other and vice-versa.
But it’s also interesting that the way we multiply regular base-10 numbers together is really a form of convolution. The decimal number ABC multiplied by DEF can be calculated by looking at C*(DEF) + B0*(DEF) + A00*(DEF), which is essentially convolving ADC with DEF (as this paper points out). If multiplication is a convolution, then we can convert to the Fourier domain to use multiplication to perform our convolution to perform our multiplication!
Here’s an example in python that takes two lists of digits and returns a new list representing the two numbers multiplied together:
import numpy as np
fft = np.fft.fft
ifft = np.fft.ifft

def mult_lists(a, b):
    """
    Multiply the arrays of digits a,b by convolving a with b via a
    Fourier transform.
    A and b should have units in the first element, then base**1, base**2, etc.
    Eg 3,412 (in base 10) would be [2, 1, 4, 3]
    """
    out_size = len(a) + len(b)
    a = a + [0] * (out_size - len(a))
    b = b + [0] * (out_size - len(b))
    f = ifft(fft(a)*fft(b)).real
    return f
To use this we can write some helper functions to convert numbers to and from lists and we then have a bigint multiplication routine:
import random

def to_digit_list(num, base=10):
    ret = []
    while num > 0:
        ret.append(num % base)
        num //= base
    return ret

def from_digit_list(nums, base=10):
    return sum(int(round(num)) * base**i for i,num in enumerate(nums))

def fourier_mult(a, b):
    return from_digit_list(mult_lists(to_digit_list(a), to_digit_list(b)))

a=random.getrandbits(256)
b=random.getrandbits(256)
ab = a*b
abf = fourier_mult(a, b)
if ab - abf == 0:
    print("Numbers are the same!")
There’s something else we can do with this. We can think of each row of Pascal’s Triangle as the previous row convoluted with (1, 1). This provides us with an easy way to calculate a row of Pascal’s triangle using Fourier transforms to do the convolution for us:
def pascal_row(n):
    x = [1,1] + [0] * (n-1)

    fx = fft(x)
    return np.round(ifft(fx ** n).real)
Rounding errors mean that only rows up to 50 or so get calculated accurately.
It’s interesting that Wikipedia talks about obtaining an alternating-sign row of Pascal’s triangle by taking the Fourier transform of sin(x)^n / x. I expect finding the right combinations of convolutions and multiplications might be able to convert between the two expressions.
[I spent some time trying this, but I found analytically finding something whose inverse Fourier transform was [1,1,0…] not that straight forward - a sinc function corresponding to a top-hat width of two shifted by half a pixel didn’t give me a particularly clean result and started giving errors for the very first rows of Pascal’s triangle. I put my attempt on github just in case anyone’s slightly interested…]

Sunday, April 19, 2015

Platonic Katamaris

Way back when SketchUp first came out (right after Google bought it), I was thinking of things that might be fun to model and add to the SketchUp warehouse. I really enjoyed the quick and simple way that SketchUp allowed you to make models, and I was also interested in coming up with something that could be created programmatically, as it would allow me to dip my toes into Ruby (SketchUp's script language of choice).

I'd also been playing Katamari Damacy for a while, and had been really enjoying it. It's a strange game (it's sometimes described more as a 'toy' than a 'game') and basically involves rolling things up, which makes you bigger, allowing you to roll up even bigger things, etc. Combine this with an amazing soundtrack and a quirky, almost oedipal, father-son relationship and it makes for an entertaining experience. So why not model a Katamari, a sticky ball which looks something like this:

It looks simple enough, a sphere with a few small spheres placed around it. But how do you place the small spheres regularly around a sphere? I thought about spacing them regularly on a circumference of the sphere, and maybe choosing a few other circumferences at equally spaced angles, but this doesn't really look like how they're positioned above. So a better way to regularly position the spheres is to place them on the faces of a regular polyhedron. This way the Katamari will look the same from all angles and the spheres will all be evenly spaced over the sphere. A square will only give 6 sub-spheres, which is too few, an octahedron only 8, which still seems too little, so a dodecahedron, with 12 sides, seems like a good choice (in fact in the image above I just noticed we can see 3 full spheres, and 6 half spheres, suggesting there are 12 in total).

So how to create a dodecahedron in SketchUp? A dodecahedron consists of 12 regularly placed pentagons. Creating the pentagons is easy, but how to combine 12 of them into a dodecahedron? What is the angle between two sides of a dodecahedron? Obviously the easy way to do this is to find some obscure website that could maybe have information like that, but it was annoying me slightly that I couldn't work it out from first principles. Surely there must be a way to calculate this from scratch using just basic mathematics? There is, and don't call me Surely.

Let's say we just have two pentagons, laying flat next to each other, as shown below:
To make a dodecahedron we need to rotate B around its shared edge with A, until the angle the adjacent sides make equal half the outer angle at their shared point. In the image above I've drawn this angle for the tip of A--a vertical line, by symmetry, will split the angle around the tip of A in half. So in this case we would rotate B to B', where its edge closest to the vertical line becomes vertical (imagine in this case that B' is rotated out of the screen with its tip pointing towards you):

We can find this angle using simple trigonometry. We will be at the correct angle when tan b = tan a . -cos c:


So the dihedral angle, c, is arccos (-tan b / tan a) where (in degrees, and N=5) a = 180-360/N and b=(180+360/N)/2. This can be simplified (?) to arccos(1 - 1 / (2 * sin(pi/n) ^ 2)) (and I'm sure probably further, but I stopped there).

Interestingly this function works for all platonic solids with Schläfli symbol {p, q} where q is 3 (ie tetrahedron, cube, dodecahedron). On Wikipedia, they use 2*arcsin(cos(pi/3)/sin(pi/x)) for this angle, which I'm guessing is the same thing. Here is a plot of the function from 2 to 6.

After calculating that angle, I just needed to place some spheres on the faces and voila:

You can also do things like a compound of 5 cubes (model):

Unfortunately, I can't actually find the ruby scripts that created the above models. They're probably on a HD somewhere...

Saturday, January 17, 2015

Binary-file Grammar

Binary-file Grammar

This is going to be a bit of a specific post about a filetype that only a small subset of mostly electron microscopists use. Like all file formats for important data, not being able to at least read them outside of a proprietary program is the kind of thing that keeps researchers awake at night. Also, in trying to find an elegant way to both read and write the files, I ended up designing a more general binary-file grammar and parser that (maybe) could be expanded to other filetypes.

I've actually written a few dm3 file parsers over the years, for some reason I quite enjoy reverse engineering file formats so I haven't minded repeating myself so much. Especially as each time I think 'This'll be the time I finally find an elegant solution for doing this'. Each time I start with the best intentions, a few functions, lots of code reuse, but then by the time the edge cases are sorted out, I'm left with a mess of functions that I have to teach myself again each time I see the code.

So slightly inspired by the amazingly simple grammar parser in the udacity design of computer programs course, and slightly dissuaded by the more complete but complicated monadic parsing, I decided to give it yet another go, but this time starting with a (almost) human readable description of the file that would be the only thing the parser then needed. I wanted something that served as both a source for the parser, and a concise description of how the file is actually structured. The problem is then split into two (or three) parts: 1. Writing a general grammar parser and then 2. writing a grammar for the specific filetype I want to read and write (and then probably another step 3. converting from the syntax tree produced by the parser into a more usable format).

Grammar

The grammar consists of definitions with each one made up of other definitions or else a struct type that can be read straight from the file (it's actually slightly more complicated than this, and can also be something that evaluates to the name of a definition, or an array of names of definitions). Here's a simple example of the grammar:

atom: len(<l)=0
atom: len(<l), string({len}s), atom 

It describes a single element, atom, that can be stored in two different ways, both starting with the '<l' type (from python struct class, a little-endian 32-bit int). If it's equal to zero it matches the first atom description and ends there. If non-zero it is followed by that number of characters, followed by another atom element. In both cases the length parameter is read into a variable called len, and subsequent elements get format'ed with any preceding values. The output from the parser (or input if writing) would be something like

{'len': 5, 'string': 'Hello' atom: 
    {'len': 6, 'string': 'World!', atom: 
        {len: 0}}}

(I'm using definitions with identical names as a way to provide options to the parser. The other (probably better) way would be to use, say:


zeroatom: len(<l)=0
nonzeroatom: len(<l), string({len}s), zeroatom | nonzeroatom

Then the two types would be differentiated in the tree returned, rather than having to know the difference between the two types to know which one you have. But I'm sticking with this way as it works and makes lists of heterogeneous data slightly easier.)

There are two magic values that get inserted automatically by the parser: 'parent', which contains the values from the parent of the current item, and 'f' which gives access to the python file-object (it's currently only used as 'f.tell()' to find the current file position, so this could be replaced with just 'position' in the future). 

DM3 Grammar

The DM3 grammar I've created is a lot more complicated than the example above. A description of the filetype can be found here, and I hope that the grammar itself can provide the information almost as easily.

Here's the grammar for the DM3 files:

dm3_grammar = """
header:     version(>l)=3, len(>l), endianness(>l)=1, _pos=f.tell(), section, 
            len=f.tell()-_pos+4, zero_pad_0(>l)=0, zero_pad_1(>l)=0

section:    is_dict(b), open(b), num_tags(>l), data(["named_data"]*num_tags)

named_data: sdtype(b)=20, name_length(>H), name({name_length}s), 
             section 

named_data: sdtype(b)=20, name_length(>H), name({name_length}s), 
             dataheader 

# struct-specific data entry
dataheader: delim(4s)="%%%%", headerlen(>l), _pos=f.tell(),
            dtype(>l)=15, struct_header, 
            headerlen=(f.tell()-_pos)/4, struct_data

# array-specific data entry
dataheader: delim(4s)="%%%%", headerlen(>l), _pos=f.tell(),
            dtype(>l)=20, array_data,
            headerlen=(array_data._end-_pos)/4

# simple data
dataheader: delim(4s)="%%%%", headerlen(>l), _pos=f.tell(), 
            dtype(>l), 
            headerlen=(f.tell()-_pos)/4, data(simpledata_{dtype})

simpledata_2 = h
simpledata_3 = i
simpledata_4 = H
simpledata_5 = I
simpledata_6 = f
simpledata_7 = d
simpledata_8 = b
simpledata_9 = b
simpledata_10 = b
simpledata_11 = q
simpledata_12 = Q

#structs
struct_header: length(>l)=0, num_fields(>l),
               types(["struct_dtype"]*num_fields)

struct_data: data([("simpledata_%s" % dtypes.dtype) for dtypes in parent.struct_header.types])

struct_dtype: length(>l)=0, dtype(>l)

array_data: arraydtype(>l)=15, struct_header, len(>l),
            _end=f.tell(), array(["struct_data"]*len)

#general case:
array_data: arraydtype(>l), len(>l),
            _end=f.tell(), array("{len}"+simpledata_{arraydtype})


"""

This has a few features not described in the simple example above. Constants can be defined using name=value, as is done for the simpledata_x values. Items beginning with an underscore, '_', are treated as variables, with the value on the right hand side being evaluated and stored in the variable. It's also possible to refer to an already defined element and either confirm the value is what is expected (when reading, or when writing and a value is specified) or else have it be set to the expected value (when writing and no value is provided). For example the 'header' atom has the following:

..., len(>l), endianness(>l)=1, _pos=f.tell(), section, len=f.tell()-_pos+4, ...



In this example, the first definition of len tells the parser the location and type of  len, then a bit later f.tell() is evaluated and put into _pos, and then later len is checked to be equal to f.tell() at a later point, (minus _pos plus 4). When writing, if len is not specified, the same process takes place only this time len is calculated from the expected value and written back into the location of len when it gets calculated. This allows the parser to automatically fill in values to their expected values when writing, making things a lot easier.

One thing it does which might give grammar designers nightmares, is allow access to higher elements in the tree too. The struct_data item needs parent.struct_header.types to exist ('parent' is automatically set by the parser each time it parses a new item) which it luckily is, in the two times we need it.

Conclusion (and DM4 files)

So is it worth going to all this trouble (writing a parser, and designing a grammar, then writing a version of it for the file and then having to massage the data out of a syntax-tree etc, rather than just parsing the thing manually)? I'm not totally sure, but I do feel as though if I ever need to remind myself how the file is structured I will refer to the grammar first as a comprehensive description of the file structure.

Also, the DM3 grammar I've shown above isn't the version I'm using now. There's also a DM4 format whose main purpose is to support 64 bit sizes. I'm actually converting a general grammar into two specific versions for both DM3 and DM4 files which means I can support both with only very minimal changes. I find and replace some fields in the general grammar as follows:

dm3_grammar_defs = { 'version': 3,
                     'len_type': '>l',
                     'len_size': 4,
                     'header_offset': 4,
                     'named_data_pre': '',
                     'named_data_post': '',
                     }

dm4_grammar_defs = { 'version': 4,
                     'len_type': '>Q',
                     'len_size': 8,
                     'header_offset': 0,
                     'named_data_pre': 'datalen(>Q), _pos=f.tell(), ',
                     'named_data_post': ', datalen=f.tell()-_pos',
                     }

This I think really shows the flexibility of using a grammar. With a general grammar and a small number of changes it's possible to describe two different filetypes and provide reading and writing capabilities for those files.

I've split this project into two projects on GitHub. There's a FileGrammar and then a separate DM3Utils project that depends upon it.

Tuesday, January 13, 2015

Somatomedin

Melanosomes, uh?

While at a Magnetic Fields' concert a few years ago, one of the members of the band, as an example of just how rock'n'roll their lifestyle was, pointed out that the word exit is composed only of valid 2-letter scrabble words (ex, xi and it).

I've always been wondering if there were more words like this, mainly as a tool to help me remember the 105 two letter scrabble words. It's quite a simple task to do in Python, and the longest such words are denominator, kinetosomes, melanosomes and somatomedin. There's 1890 words in total (not including the two letter words themsleves) and a few more interesting words are sinusoidal, bonefishes, monoxides and tomatoes. The only two-letter word not to appear in any longer word is uh. Trying to find the shortest number of words needed to include all two-letter words is slightly harder, and the best I came up with was the following 30 words.

kinetosomes
dishonored
tamoxifen
myelitides
kabalas
shadower
pagoda
manumit
boyar
rehemmed
unai
lope
ofay
befit
nexus
ohm
ahi
joe
mut
zax
pinup
baaed
myoid
byelaw
habitat
si
qi
uh
wo
gi

To be honest, it's probably easier to remember the 105.

It's also interesting to try the same thing with three-letter words. There's a lot more of them (997, with a lot less overlap -- I could only reduce them to 742 longer words) , but there are some interesting composite words, such as outskatedoperated and monorail.

There's also 509 words that act as both such as shadowed, honored and panamas.

The code for finding the words can be found on Github.