Saturday, August 19, 2017

Solving a Sunday Puzzle with Python and NLTK

Solving a Sunday Puzzle with Python and NLTK

NPR (the US version of Radio 4, if you like) has a regular Sunday puzzle (starring Will Shortz, the NYTimes puzzle editor).

The 2017-07-09 puzzle goes as follows:
Take a certain 7-letter word. Remove the first letter and you get a 6-letter synonym of that word. And the letter you removed is an abbreviation for the opposite of both words.
I spent quite a bit of time trying, unsuccessfully,  to come up with the answer - I kept thinking it would be a inflammable/flammable type word but the only 1-letter abbreviations I could think of were thinks like l, m, s for large, medium, small.

So I figured I'd use a word list to just look through all the seven letter words whose last six letters also made a word. I decided to use the Python natural language toolkit, or NLTK, purely as a dictionary. It turned out to be a wise choice as it wasn't as easy as I'd hoped:

from nltk.corpus import words
set(w for w in words.words() if len(w) == 7)
seven = set(w for w in words.words() if len(w) == 7)
print("Found {} 7 letter words".format(len(seven)))
six = set(w for w in words.words() if len(w) == 6)
print("Found {} 6 letter words".format(len(six)))

Which gives

Found 23870 7 letter words
Found 17708 6 letter words

So we have about 24k 7-letter words. How many have a 6-letter word as a suffix:

sub_is_six = set(w for w in seven if w[1:] in six)
print("Found {} 7 letter words where word[1:] is also a word".format(len(sub_is_six)))

Found 1621 7 letter words where word[1:] is also a word

Ok, there are 1621 7-letter words that have this property: slotter, apathic, zincite, barrack, hattery, etc, etc. I tried manually exploring the list, but it was just too many.

Luckily with NLTK, it's possible to get a list of synonyms for words using the wordnet module. Here's a function that returns a set of synonyms for a given word:

def get_synonyms(word):
    return {n for x in nltk.wordnet.wordnet.synsets(word) 
            for n in x.lemma_names()}

Now we can use this to filter our 7-letter words:

sub_is_syn = {w for w in sub_is_six if w[1:] in get_synonyms(w)}

Which gives us a list of 11 words (much more manageable): arouser, asquint, challah, crumple, factual, grumble, opossum, orotund, screaky, spatter, twinkle.

Factual is the correct answer, as it's the only one whose first letter is an abbreviation for its antonym: 'f' for false.

After doing this it's easy to extend to words with more or less letters. For example amongst 9-letter words whose 8-letter suffix is a synonym of the original we can find the python keyword enumerate, and the old word for odometer, hodometer. Interestingly, hodometer comes from the greek word hodos (ancient Greek for for path) and when combined with meta (meaning development in ancient Greek) it produces the word 'method'.

Saturday, March 11, 2017

Solving edge-effects when blurring

Solving edge-effects when blurring

One of the problems with blurring an image is what happens at the edges. For example, consider an image that has a large brightness difference between the left and right edges:

A common way to blur an image is to take advantage of the convolution theorem and turn a Gaussian blur (convolution of the source image with a Gaussian peak) into a multiplication in the Fourier domain (convert src and Gaussian peaks into their Fourier representations, multiply instead of convolve, then inverse Fourier transform). Here's some code to do just that:

import numpy as np
import cv2

def gaussian_quadrant(shape, sds):
    Return a gaussian of the given shape, with the given standard deviations.
    Treats 0,0 as origin
    return reduce(np.multiply, 
                  (np.exp(-dx**2 / (2*sd**2)) 
                   for sd, dx in zip(sds, np.indices(shape))))

def mirror_image(im):
    Takes an image A, and mirrors in x and y
    out = im
    left = np.vstack([im, im[::-1, :]])
    return np.hstack([left, left[:, ::-1]])

def gaussian_blur_cv2_dft(im, amount):
    # gaussian blur using dft
    h, w = im.shape
    quarter_gauss = gaussian_quadrant([h//2, w//2], 
                                      [h/(2*np.pi*amount), w/(2*np.pi*amount)])
    full_gauss = mirror_image(quarter_gauss)
    Fsrc = cv2.dft(im.astype(float), None, cv2.DFT_COMPLEX_OUTPUT)
    Fsrc[..., 0] *= full_gauss
    Fsrc[..., 1] *= full_gauss
    inv = cv2.dft(Fsrc, None, cv2.DFT_INVERSE | cv2.DFT_REAL_OUTPUT | cv2.DFT_SCALE)
    return inv.astype(im.dtype)

This uses the opencv dft (discrete fourier transform) method to perform the transformations to and from Fourier space (there's no need to actually Fourier transform the Gaussian peak, as the transform is just another Gaussian. We just need to work out how big the peak is in Fourier space and directly create it with that size).

If we do this with amount=50, we get the following (for the blurred images below, I've scaled by 2 to exaggerate the effects)

The problem with this is that the left edge has a bright spot that doesn't appear in the original, and the right hand edge is darker than it should be. This is because of the implicit boundary conditions of the discrete-Fourier transform. The images is treated as if it is repeated periodically:

And we can see why blurring this produces 'artefacts' on the edges.

Solution #1: Mirroring the image

One solution is to simply mirror the image, blur it, and then crop it again. The mirrored image before blurring looks like:

Then, after blurring and cropping, we get:

This looks a lot better than our first attempt. The problem is that it uses 4 times more memory. Here's the code to do this. It uses exactly the same function as before

src_h, src_w = src.shape
src_x4 = mirror_image(src)
dft_x4_blur = gaussian_blur_cv2_dft(src_x4, blur_amount)
dft_x4_blur_cropped = dft_x4_blur[:src_h, :src_w]

Solution #2: DCT

The problem with the discrete Fourier transform method is purely that its boundary conditions are perfect for repeating data, but don't work when blurring images: we'd prefer a symmetric boundary condition (effectively mirroring the image for us, like we did above). Luckily, there's a transform with exactly this property: the discrete cosine transform. Using this instead of a Fourier transform actually makes things a bit easier (the data is real all the way through). Here's the code for a dct-based blurring function:

def gaussian_blur_cv2_dct(im, amount):
    # gaussian blur using cv2 dct
    h, w = im.shape
    gaussian = gaussian_quadrant([h, w], 
                                 [h/(np.pi*amount), w/(np.pi*amount)])
    Fsrc = cv2.dct(im.astype(float))
    Fsrc *= gaussian
    inv = cv2.dct(Fsrc, None, cv2.DCT_INVERSE)
    return inv.astype(im.dtype)

Here's the results using this:

Again, this has solved the edge artifact issues, and this time we didn't need to use four times the data.


dft blur: 50ms
dft scaled then cropped: 229ms
dct blur: 87ms

I was kind of hoping that not only would the dct work better than the dft, but that it would also be significantly faster (it is limited to purely real data, and is used all the time in decoding JPEGs (and I expect movie files too) so I was hoping it would be massively optimized). But the opencv2 version runs significantly slower than the dft. It's still significantly faster than the scaling, blurring, then cropping approach though, and I expect uses a lot less memory.

The Jupyter notebook I used to perform all these experiments is on GitHub.

Sunday, March 27, 2016

(11 + 1/pi) / sqrt(2) bits in a byte

(11 + 1/pi) / sqrt(2) bits in a byte

Euler's identity has been cited as 'The most beautiful equation in mathematics', as it contains e, pi, the square root of minus 1, and 1:
Personally, I find the more general Euler's formula just a little more surprising (what are those trigonometric functions doing in there?) The identity above is simply the case when theta=pi.
Changing the subject to some simpler maths, I've been thinking about what kind of calculations can be done purely mentally. Multiplying numbers together is quite doable (if you can remember all of the terms, I find two digit numbers about my limit). A bit more interesting is estimating square roots. In some ways it's nice to know you'll never get the exact (irrational) answer, so it's really a case of how good an estimation you'll need, and it involves multiplication too anyway. The easiest way to estimate is to know the gradient of x^2 at x is 2x, and that a is the closest root of B if B - a^2 < a. Then we can start to mentally estimate, say, the square root of 7.

First step, find the closest integer root. 3^2 is 9, which is less than 3 away from our wanted root, and is therefore the closest. Then we just need to divide the difference (-2) by 2*3 (I'v emboldened the closest root to highlight when it's being used). So sqrt(7) ~= 3 - 2/(2*3) ~= 2.67. 

If we want to do the next refinement, we can multiply by 100, and say sqrt(7) = sqrt(700)/10. And we already have a starting point for 7, so we can use ten times that for 700. 26^2 is 400 + 120 + 120 + 36 = 676. This is 24 away from 700, so again 26 is the closest root (it's worth remembering the approximation we're using is always an overestimation, so when trying the next value try the lower value first - in this case we had 26 or 27 as probably the closest and 26 is actually (slightly) closer).

Now, sqrt(700) ~= 26 + 24 / 2*26 = 26 + 6/13. 6/13 is .46 giving sqrt(700) ~= 26.46 and therefore sqrt(7) ~= 2.646. Now we can cheat and check this, and we find we're good to 4 digits for approximating the real value (2.6457...).

The above isn't particularly interesting at all, but it's something to try if you're having trouble getting to sleep. But one slightly interesting case I came across was sqrt(128) (which is sqrt(2) * 8). The closest integer root is 11, and we get sqrt(128) ~= 11 + 7 / 2 * 11.  Then I remembered that 22/7 was a rough approximation for pi. In other words sqrt(128) ~= 11 + 1/pi.

So the next time you need to use some common powers of two in some code, I recommend considering the following optimisations:
These are really close to being true (8.0033... and 16.0065... respectively) and both slightly larger than the left hand side, so coercing to an integer should be safe. Adding a comment about the 'robust continued fraction expansion of expressions of the form prime + 1/x where x is transcendental' wouldn't hurt either.

Thursday, November 12, 2015

Pascal's Triangle and Fourier Transforms

One of the really useful properties of the Fourier transform is that convolution in one domain becomes multiplication in the other and vice-versa.
But it’s also interesting that the way we multiply regular base-10 numbers together is really a form of convolution. The decimal number ABC multiplied by DEF can be calculated by looking at C*(DEF) + B0*(DEF) + A00*(DEF), which is essentially convolving ADC with DEF (as this paper points out). If multiplication is a convolution, then we can convert to the Fourier domain to use multiplication to perform our convolution to perform our multiplication!
Here’s an example in python that takes two lists of digits and returns a new list representing the two numbers multiplied together:
import numpy as np
fft = np.fft.fft
ifft = np.fft.ifft

def mult_lists(a, b):
    Multiply the arrays of digits a,b by convolving a with b via a
    Fourier transform.
    A and b should have units in the first element, then base**1, base**2, etc.
    Eg 3,412 (in base 10) would be [2, 1, 4, 3]
    out_size = len(a) + len(b)
    a = a + [0] * (out_size - len(a))
    b = b + [0] * (out_size - len(b))
    f = ifft(fft(a)*fft(b)).real
    return f
To use this we can write some helper functions to convert numbers to and from lists and we then have a bigint multiplication routine:
import random

def to_digit_list(num, base=10):
    ret = []
    while num > 0:
        ret.append(num % base)
        num //= base
    return ret

def from_digit_list(nums, base=10):
    return sum(int(round(num)) * base**i for i,num in enumerate(nums))

def fourier_mult(a, b):
    return from_digit_list(mult_lists(to_digit_list(a), to_digit_list(b)))

ab = a*b
abf = fourier_mult(a, b)
if ab - abf == 0:
    print("Numbers are the same!")
There’s something else we can do with this. We can think of each row of Pascal’s Triangle as the previous row convoluted with (1, 1). This provides us with an easy way to calculate a row of Pascal’s triangle using Fourier transforms to do the convolution for us:
def pascal_row(n):
    x = [1,1] + [0] * (n-1)

    fx = fft(x)
    return np.round(ifft(fx ** n).real)
Rounding errors mean that only rows up to 50 or so get calculated accurately.
It’s interesting that Wikipedia talks about obtaining an alternating-sign row of Pascal’s triangle by taking the Fourier transform of sin(x)^n / x. I expect finding the right combinations of convolutions and multiplications might be able to convert between the two expressions.
[I spent some time trying this, but I found analytically finding something whose inverse Fourier transform was [1,1,0…] not that straight forward - a sinc function corresponding to a top-hat width of two shifted by half a pixel didn’t give me a particularly clean result and started giving errors for the very first rows of Pascal’s triangle. I put my attempt on github just in case anyone’s slightly interested…]

Sunday, April 19, 2015

Platonic Katamaris

Way back when SketchUp first came out (right after Google bought it), I was thinking of things that might be fun to model and add to the SketchUp warehouse. I really enjoyed the quick and simple way that SketchUp allowed you to make models, and I was also interested in coming up with something that could be created programmatically, as it would allow me to dip my toes into Ruby (SketchUp's script language of choice).

I'd also been playing Katamari Damacy for a while, and had been really enjoying it. It's a strange game (it's sometimes described more as a 'toy' than a 'game') and basically involves rolling things up, which makes you bigger, allowing you to roll up even bigger things, etc. Combine this with an amazing soundtrack and a quirky, almost oedipal, father-son relationship and it makes for an entertaining experience. So why not model a Katamari, a sticky ball which looks something like this:

It looks simple enough, a sphere with a few small spheres placed around it. But how do you place the small spheres regularly around a sphere? I thought about spacing them regularly on a circumference of the sphere, and maybe choosing a few other circumferences at equally spaced angles, but this doesn't really look like how they're positioned above. So a better way to regularly position the spheres is to place them on the faces of a regular polyhedron. This way the Katamari will look the same from all angles and the spheres will all be evenly spaced over the sphere. A square will only give 6 sub-spheres, which is too few, an octahedron only 8, which still seems too little, so a dodecahedron, with 12 sides, seems like a good choice (in fact in the image above I just noticed we can see 3 full spheres, and 6 half spheres, suggesting there are 12 in total).

So how to create a dodecahedron in SketchUp? A dodecahedron consists of 12 regularly placed pentagons. Creating the pentagons is easy, but how to combine 12 of them into a dodecahedron? What is the angle between two sides of a dodecahedron? Obviously the easy way to do this is to find some obscure website that could maybe have information like that, but it was annoying me slightly that I couldn't work it out from first principles. Surely there must be a way to calculate this from scratch using just basic mathematics? There is, and don't call me Surely.

Let's say we just have two pentagons, laying flat next to each other, as shown below:
To make a dodecahedron we need to rotate B around its shared edge with A, until the angle the adjacent sides make equal half the outer angle at their shared point. In the image above I've drawn this angle for the tip of A--a vertical line, by symmetry, will split the angle around the tip of A in half. So in this case we would rotate B to B', where its edge closest to the vertical line becomes vertical (imagine in this case that B' is rotated out of the screen with its tip pointing towards you):

We can find this angle using simple trigonometry. We will be at the correct angle when tan b = tan a . -cos c:

So the dihedral angle, c, is arccos (-tan b / tan a) where (in degrees, and N=5) a = 180-360/N and b=(180+360/N)/2. This can be simplified (?) to arccos(1 - 1 / (2 * sin(pi/n) ^ 2)) (and I'm sure probably further, but I stopped there).

Interestingly this function works for all platonic solids with Schläfli symbol {p, q} where q is 3 (ie tetrahedron, cube, dodecahedron). On Wikipedia, they use 2*arcsin(cos(pi/3)/sin(pi/x)) for this angle, which I'm guessing is the same thing. Here is a plot of the function from 2 to 6.

After calculating that angle, I just needed to place some spheres on the faces and voila:

You can also do things like a compound of 5 cubes (model):

Unfortunately, I can't actually find the ruby scripts that created the above models. They're probably on a HD somewhere...

Saturday, January 17, 2015

Binary-file Grammar

Binary-file Grammar

This is going to be a bit of a specific post about a filetype that only a small subset of mostly electron microscopists use. Like all file formats for important data, not being able to at least read them outside of a proprietary program is the kind of thing that keeps researchers awake at night. Also, in trying to find an elegant way to both read and write the files, I ended up designing a more general binary-file grammar and parser that (maybe) could be expanded to other filetypes.

I've actually written a few dm3 file parsers over the years, for some reason I quite enjoy reverse engineering file formats so I haven't minded repeating myself so much. Especially as each time I think 'This'll be the time I finally find an elegant solution for doing this'. Each time I start with the best intentions, a few functions, lots of code reuse, but then by the time the edge cases are sorted out, I'm left with a mess of functions that I have to teach myself again each time I see the code.

So slightly inspired by the amazingly simple grammar parser in the udacity design of computer programs course, and slightly dissuaded by the more complete but complicated monadic parsing, I decided to give it yet another go, but this time starting with a (almost) human readable description of the file that would be the only thing the parser then needed. I wanted something that served as both a source for the parser, and a concise description of how the file is actually structured. The problem is then split into two (or three) parts: 1. Writing a general grammar parser and then 2. writing a grammar for the specific filetype I want to read and write (and then probably another step 3. converting from the syntax tree produced by the parser into a more usable format).


The grammar consists of definitions with each one made up of other definitions or else a struct type that can be read straight from the file (it's actually slightly more complicated than this, and can also be something that evaluates to the name of a definition, or an array of names of definitions). Here's a simple example of the grammar:

atom: len(<l)=0
atom: len(<l), string({len}s), atom 

It describes a single element, atom, that can be stored in two different ways, both starting with the '<l' type (from python struct class, a little-endian 32-bit int). If it's equal to zero it matches the first atom description and ends there. If non-zero it is followed by that number of characters, followed by another atom element. In both cases the length parameter is read into a variable called len, and subsequent elements get format'ed with any preceding values. The output from the parser (or input if writing) would be something like

{'len': 5, 'string': 'Hello' atom: 
    {'len': 6, 'string': 'World!', atom: 
        {len: 0}}}

(I'm using definitions with identical names as a way to provide options to the parser. The other (probably better) way would be to use, say:

zeroatom: len(<l)=0
nonzeroatom: len(<l), string({len}s), zeroatom | nonzeroatom

Then the two types would be differentiated in the tree returned, rather than having to know the difference between the two types to know which one you have. But I'm sticking with this way as it works and makes lists of heterogeneous data slightly easier.)

There are two magic values that get inserted automatically by the parser: 'parent', which contains the values from the parent of the current item, and 'f' which gives access to the python file-object (it's currently only used as 'f.tell()' to find the current file position, so this could be replaced with just 'position' in the future). 

DM3 Grammar

The DM3 grammar I've created is a lot more complicated than the example above. A description of the filetype can be found here, and I hope that the grammar itself can provide the information almost as easily.

Here's the grammar for the DM3 files:

dm3_grammar = """
header:     version(>l)=3, len(>l), endianness(>l)=1, _pos=f.tell(), section, 
            len=f.tell()-_pos+4, zero_pad_0(>l)=0, zero_pad_1(>l)=0

section:    is_dict(b), open(b), num_tags(>l), data(["named_data"]*num_tags)

named_data: sdtype(b)=20, name_length(>H), name({name_length}s), 

named_data: sdtype(b)=20, name_length(>H), name({name_length}s), 

# struct-specific data entry
dataheader: delim(4s)="%%%%", headerlen(>l), _pos=f.tell(),
            dtype(>l)=15, struct_header, 
            headerlen=(f.tell()-_pos)/4, struct_data

# array-specific data entry
dataheader: delim(4s)="%%%%", headerlen(>l), _pos=f.tell(),
            dtype(>l)=20, array_data,

# simple data
dataheader: delim(4s)="%%%%", headerlen(>l), _pos=f.tell(), 
            headerlen=(f.tell()-_pos)/4, data(simpledata_{dtype})

simpledata_2 = h
simpledata_3 = i
simpledata_4 = H
simpledata_5 = I
simpledata_6 = f
simpledata_7 = d
simpledata_8 = b
simpledata_9 = b
simpledata_10 = b
simpledata_11 = q
simpledata_12 = Q

struct_header: length(>l)=0, num_fields(>l),

struct_data: data([("simpledata_%s" % dtypes.dtype) for dtypes in parent.struct_header.types])

struct_dtype: length(>l)=0, dtype(>l)

array_data: arraydtype(>l)=15, struct_header, len(>l),
            _end=f.tell(), array(["struct_data"]*len)

#general case:
array_data: arraydtype(>l), len(>l),
            _end=f.tell(), array("{len}"+simpledata_{arraydtype})


This has a few features not described in the simple example above. Constants can be defined using name=value, as is done for the simpledata_x values. Items beginning with an underscore, '_', are treated as variables, with the value on the right hand side being evaluated and stored in the variable. It's also possible to refer to an already defined element and either confirm the value is what is expected (when reading, or when writing and a value is specified) or else have it be set to the expected value (when writing and no value is provided). For example the 'header' atom has the following:

..., len(>l), endianness(>l)=1, _pos=f.tell(), section, len=f.tell()-_pos+4, ...

In this example, the first definition of len tells the parser the location and type of  len, then a bit later f.tell() is evaluated and put into _pos, and then later len is checked to be equal to f.tell() at a later point, (minus _pos plus 4). When writing, if len is not specified, the same process takes place only this time len is calculated from the expected value and written back into the location of len when it gets calculated. This allows the parser to automatically fill in values to their expected values when writing, making things a lot easier.

One thing it does which might give grammar designers nightmares, is allow access to higher elements in the tree too. The struct_data item needs parent.struct_header.types to exist ('parent' is automatically set by the parser each time it parses a new item) which it luckily is, in the two times we need it.

Conclusion (and DM4 files)

So is it worth going to all this trouble (writing a parser, and designing a grammar, then writing a version of it for the file and then having to massage the data out of a syntax-tree etc, rather than just parsing the thing manually)? I'm not totally sure, but I do feel as though if I ever need to remind myself how the file is structured I will refer to the grammar first as a comprehensive description of the file structure.

Also, the DM3 grammar I've shown above isn't the version I'm using now. There's also a DM4 format whose main purpose is to support 64 bit sizes. I'm actually converting a general grammar into two specific versions for both DM3 and DM4 files which means I can support both with only very minimal changes. I find and replace some fields in the general grammar as follows:

dm3_grammar_defs = { 'version': 3,
                     'len_type': '>l',
                     'len_size': 4,
                     'header_offset': 4,
                     'named_data_pre': '',
                     'named_data_post': '',

dm4_grammar_defs = { 'version': 4,
                     'len_type': '>Q',
                     'len_size': 8,
                     'header_offset': 0,
                     'named_data_pre': 'datalen(>Q), _pos=f.tell(), ',
                     'named_data_post': ', datalen=f.tell()-_pos',

This I think really shows the flexibility of using a grammar. With a general grammar and a small number of changes it's possible to describe two different filetypes and provide reading and writing capabilities for those files.

I've split this project into two projects on GitHub. There's a FileGrammar and then a separate DM3Utils project that depends upon it.

Tuesday, January 13, 2015


Melanosomes, uh?

While at a Magnetic Fields' concert a few years ago, one of the members of the band, as an example of just how rock'n'roll their lifestyle was, pointed out that the word exit is composed only of valid 2-letter scrabble words (ex, xi and it).

I've always been wondering if there were more words like this, mainly as a tool to help me remember the 105 two letter scrabble words. It's quite a simple task to do in Python, and the longest such words are denominator, kinetosomes, melanosomes and somatomedin. There's 1890 words in total (not including the two letter words themsleves) and a few more interesting words are sinusoidal, bonefishes, monoxides and tomatoes. The only two-letter word not to appear in any longer word is uh. Trying to find the shortest number of words needed to include all two-letter words is slightly harder, and the best I came up with was the following 30 words.


To be honest, it's probably easier to remember the 105.

It's also interesting to try the same thing with three-letter words. There's a lot more of them (997, with a lot less overlap -- I could only reduce them to 742 longer words) , but there are some interesting composite words, such as outskatedoperated and monorail.

There's also 509 words that act as both such as shadowed, honored and panamas.

The code for finding the words can be found on Github.