Thursday, March 28, 2013

Converting Grayscale to RGB with Numpy

Converting Grayscale to RGB with Numpy


There's a lot of scientific two-dimensional data out there, and if it's grayscale, sooner or later you need to convert it to RGB (or RGBA). I couldn't find any info about the bast way to do this in numpy, a typical scenario is converting a x by y array of floats into a x by y by 3 array of 8-bit ints. Numpy has a few ways to do it (that I could think of) and I figured testing them is the only real way to see which way's the fastest.
In the following, I'm assuming that any processing steps have already been taken, and all that's left is to truncate the values to unsigned 8-bit.

The file containing the tests can be downloaded from github.

Method 1: Basic

The first way I tried was to simply create the image and populate it with the array three times:

def to_rgb1(im):
    # I think this will be slow
    w, h = im.shape
    ret = np.empty((w, h, 3), dtype=np.uint8)
    ret[:, :, 0] = im
    ret[:, :, 1] = im
    ret[:, :, 2] = im
    return ret

This can be slightly modified (in Python, multiple assignment works by assigning the rightmost expression to all of the others, so the two methods here are different):

def to_rgb1a(im):
    # This should be fsater than 1, as we only
    # truncate to uint8 once (?)
    w, h = im.shape
    ret = np.empty((w, h, 3), dtype=np.uint8)
    ret[:, :, 2] =  ret[:, :, 1] =  ret[:, :, 0] =  im
    return ret

def to_rgb1b(im):
    # I would expect this to be identical to 1a
    w, h = im.shape
    ret = np.empty((w, h, 3), dtype=np.uint8)
    ret[:, :, 0] = im
    ret[:, :, 1] = ret[:, :, 2] = ret[:, :, 0]
    return ret

Method 2: Broadcasting


We can create the array as before, and assign it in one line, so I'd hoped this would be faster than method 1:

def to_rgb2(im):
    # as 1, but we use broadcasting in one line
    w, h = im.shape
    ret = np.empty((w, h, 3), dtype=np.uint8)
    ret[:, :, :] = im[:, :, np.newaxis]
    return ret

Method 3: Using dstack

The dstack method stacks arrays in the third dimesnion, and as the docs say 'This is a simple way to stack 2D arrays (images) into a single 3D array for processing'.

def to_rgb3(im):
    # we can use dstack and an array copy
    # this has to be slow, we create an array with
    # 3x the data we need and truncate afterwards
    return np.asarray(np.dstack((im, im, im)), dtype=np.uint8)


def to_rgb3a(im):
    # we can use the same array 3 times, converting to
    # uint8 first
    # this explicitly converts to np.uint8 once and is short
    return np.dstack([im.astype(np.uint8)] * 3)

But, there's a problem with the above: The returned array is not contiguous. This can be seen by comparing np.frombuffer(to_rgb3a(im)) vs np.frombuffer(to_rgb1(im)). Since most display code needs a contiguous buffer of RGB pixels this won't work. Luckily the copy function will fix this for us, but it requires an extra step:

def to_rgb3b(im):
    # as 3a, but we add an extra copy to contiguous 'C' order
    # data
    return np.dstack([im.astype(np.uint8)] * 3).copy(order='C')

Method 4: Using weave

Another way to do this is to use the weave module (included with SciPy) to write C code inline, which gets compiled to native code. Since many of the numpy functions are native code anyway it's not obvious this is going to make any difference, but it's something to try.

def to_rgb4(im):
    # we use weave to do the assignment in C code
    # this only gets compiled on the first call
    import scipy.weave as weave
    w, h = im.shape
    ret = np.empty((w, h, 3), dtype=np.uint8)
    code = """
    int impos=0;
    int retpos=0;
    for(int j=0; j<Nim[1]; j++)
    {
        for (int i=0; i<Nim[0]; i++)
        {
            unsigned char d=im[impos++];
            ret[retpos++] = d;
            ret[retpos++] = d;
            ret[retpos++] = d;
        }
    }
    """
    weave.inline(code, ["im", "ret"])
    return ret

Method 5: Using repeat

This was suggested by greger:

def to_rgb5(im):
    im.resize((im.shape[0], im.shape[1], 1))
    return np.repeat(im.astype(np.uint8), 3, 2)

Testing and timing

To test the above methods, I used inspection to find any functions beginning 'to_rgb' and tested them on data returned from the numpy.random.uniform function (which returns float64 data) for various image sizes, after checking all functions gave identical output.

The full code is here:

import numpy as np
import time
import sys


def to_rgb1(im):
    # I think this will be slow
    w, h = im.shape
    ret = np.empty((w, h, 3), dtype=np.uint8)
    ret[:, :, 0] = im
    ret[:, :, 1] = im
    ret[:, :, 2] = im
    return ret

def to_rgb1a(im):
    # This should be fsater than 1, as we only
    # truncate to uint8 once (?)
    w, h = im.shape
    ret = np.empty((w, h, 3), dtype=np.uint8)
    ret[:, :, 2] =  ret[:, :, 1] =  ret[:, :, 0] =  im
    return ret

def to_rgb1b(im):
    # I would expect this to be identical to 1a
    w, h = im.shape
    ret = np.empty((w, h, 3), dtype=np.uint8)
    ret[:, :, 0] = im
    ret[:, :, 1] = ret[:, :, 2] = ret[:, :, 0]
    return ret

def to_rgb2(im):
    # as 1, but we use broadcasting in one line
    w, h = im.shape
    ret = np.empty((w, h, 3), dtype=np.uint8)
    ret[:, :, :] = im[:, :, np.newaxis]
    return ret


def to_rgb3(im):
    # we can use dstack and an array copy
    # this has to be slow, we create an array with
    # 3x the data we need and truncate afterwards
    return np.asarray(np.dstack((im, im, im)), dtype=np.uint8)


def to_rgb3a(im):
    # we can use the same array 3 times, converting to
    # uint8 first
    # this explicitly converts to np.uint8 once and is short
    return np.dstack([im.astype(np.uint8)] * 3)


def to_rgb3b(im):
    # as 3a, but we add an extra copy to contiguous 'C' order
    # data
    return np.dstack([im.astype(np.uint8)] * 3).copy(order='C')


def to_rgb4(im):
    # we use weave to do the assignment in C code
    # this only gets compiled on the first call
    import scipy.weave as weave
    w, h = im.shape
    ret = np.empty((w, h, 3), dtype=np.uint8)
    code = """
    int impos=0;
    int retpos=0;
    for(int j=0; j<Nim[1]; j++)
    {
        for (int i=0; i<Nim[0]; i++)
        {
            unsigned char d=im[impos++];
            ret[retpos++] = d;
            ret[retpos++] = d;
            ret[retpos++] = d;
        }
    }
    """
    weave.inline(code, ["im", "ret"])
    return ret

def to_rgb5(im):
    im.resize((im.shape[0], im.shape[1], 1))
    return np.repeat(im.astype(np.uint8), 3, 2)

funcs = dict(((x,eval(x)) for x in list(globals()) if "to_rgb" in x))
print "testing Numpy v",np.version.version
print "on Python ", sys.version
for size in [64,256,1024,2048]:
    s = np.random.uniform(256, size=(size,size))
    op = [funcs[f](s) for f in funcs]
    assert(all(map(lambda x: np.array_equal(x, op[0]), op)))
    times=min(3*10**7/size**2, 10**5) or 1
    print "\n\nFor Size",size,"\n==============="
    for name, func in sorted(funcs.items()):
        start = time.time()
        for i in range(times):
            func(s)
        end=time.time()
        print "%s took \t%0.3f ms average for %d times" % (
            name, (1000*(end-start)/times),times)

Results

The results were a little bit unexpected, and highlight the idiom that testing is the only real way to gauge performance. The following table lists the average time in ms to process a 1024x1024 image. The Windows results were using a version of numpy compiled against MKL as found here. All results were from Python 2.7.3.

Architectureto_rgb1to_rgb1ato_rgb1bto_rgb2to_rgb3to_rgb3ato_rgb3bto_rgb4to_rgb5
MacBookPro
Win64/Python32
14.8 14.8 8.9 15.9 28.8 8.7 9.55.926.7
MacBookPro
Win64/Python64
3.3 3.2 4.7 6.8 17.4 3.8 4.83.721.1
MacBookPro
OSX64/Python32
2.6 2.6 3.7 9.2 20.1 2.9 3.41.419.9
MacBookPro
OSX64/Python64
2.7 2.6 3.6 5.7 16.7 2.8 4.12.118.4
Xeon W3520
Win64/Python32
9.2 9.2 8.4 22.4 54.1 12.3 40.36.8
Xeon W3520
Win64/Python64
5.7 5.3 7.2 13.7 51.3 10.9 30.04.4

The first thing to notice is just how varied these different methods are. to_rgb3 is the only obviously sub-optimal implementation, and it shows, being the slowest for any configuration. There is a surprising difference in the performance under Windows 7 vs MacOSX for the same hardware, with the Mac performing noticeably better. to_rgb3a works well under most configurations, but the overhead of the extra copy in rgb_3b varies from being relatively minor (~10% for Win64/Python32) to awful (~200% for the Xeon Win64/Python64).  The Python32 results under Win64 are awful for the MacBook, but relatively better for the Xeon machine.

1 and 1a produced identical results, with 1b working noticeably better under Win64/Python32 but worse in the other configurations.  Method 2 was surprisingly slow, especially considering it was one call into C code.

In general the results above are just confusing, and vary significantly on different platforms and different architectures. Methods 1/1a give generally decent results, and I can't think of a good reason not to recommend them, so I'd have to go with 1a as the least worst most generally optimal solution, and the one I'd tentatively recommend. The weave results are generally good, but they do require a compiler to be installed and set up on the machine in use. Another thing to note is that converting to 8-bit doesn't coerce values, so you need to be careful to do this yourself if needed. In this case using weave to coerce and assign all at once may well give the best performance.

5 comments:

  1. How about using resize and repeat?

    img.resize((img.shape[0], img.shape[1], 1))
    np.repeat(img, 3, 2)

    ReplyDelete
  2. Hi Greger,

    Thanks for the suggestion, I'll try that and add it to the results.

    ReplyDelete
  3. I've always been more of a Python programmer than a C programmer, so forgive my naiveté in asking you why Nim[0] and Nim[1] in to_rgb4 yields dimensions

    ReplyDelete
    Replies
    1. Weave creates special variables for the arguments if they're Numpy arrays. There's a list of them at here

      I haven't used weave much but it seems as though it's going to be deprecated from SciPy and they recommend Cython instead now. I'll try and add a Cythonized comparison in the future.

      Delete
  4. Thank you, I have exactly this problem and was finding method 1 dismally slow even without type casting. It's disappointing that the fancier approaches don't help, but at least I can move on.

    ReplyDelete