## Converting Grayscale to RGB with Numpy

There's a lot of scientific two-dimensional data out there, and if it's grayscale, sooner or later you need to convert it to RGB (or RGBA). I couldn't find any info about the bast way to do this in numpy, a typical scenario is converting a x by y array of floats into a x by y by 3 array of 8-bit ints. Numpy has a few ways to do it (that I could think of) and I figured testing them is the only real way to see which way's the fastest.

In the following, I'm assuming that any processing steps have already been taken, and all that's left is to truncate the values to unsigned 8-bit.

The file containing the tests can be downloaded from github.

### Method 1: Basic

The first way I tried was to simply create the image and populate it with the array three times:def to_rgb1(im): # I think this will be slow w, h = im.shape ret = np.empty((w, h, 3), dtype=np.uint8) ret[:, :, 0] = im ret[:, :, 1] = im ret[:, :, 2] = im return ret

This can be slightly modified (in Python, multiple assignment works by assigning the rightmost expression to all of the others, so the two methods here are different):

def to_rgb1a(im): # This should be fsater than 1, as we only # truncate to uint8 once (?) w, h = im.shape ret = np.empty((w, h, 3), dtype=np.uint8) ret[:, :, 2] = ret[:, :, 1] = ret[:, :, 0] = im return ret def to_rgb1b(im): # I would expect this to be identical to 1a w, h = im.shape ret = np.empty((w, h, 3), dtype=np.uint8) ret[:, :, 0] = im ret[:, :, 1] = ret[:, :, 2] = ret[:, :, 0] return ret

### Method 2: Broadcasting

We can create the array as before, and assign it in one line, so I'd hoped this would be faster than method 1:

def to_rgb2(im): # as 1, but we use broadcasting in one line w, h = im.shape ret = np.empty((w, h, 3), dtype=np.uint8) ret[:, :, :] = im[:, :, np.newaxis] return ret

### Method 3: Using dstack

The dstack method stacks arrays in the third dimesnion, and as the docs say 'This is a simple way to stack 2D arrays (images) into a single 3D array for processing'.def to_rgb3(im): # we can use dstack and an array copy # this has to be slow, we create an array with # 3x the data we need and truncate afterwards return np.asarray(np.dstack((im, im, im)), dtype=np.uint8) def to_rgb3a(im): # we can use the same array 3 times, converting to # uint8 first # this explicitly converts to np.uint8 once and is short return np.dstack([im.astype(np.uint8)] * 3)

**But,**there's a problem with the above: The returned array is not contiguous. This can be seen by comparing np.frombuffer(to_rgb3a(im)) vs np.frombuffer(to_rgb1(im)). Since most display code needs a contiguous buffer of RGB pixels this won't work. Luckily the copy function will fix this for us, but it requires an extra step:

def to_rgb3b(im): # as 3a, but we add an extra copy to contiguous 'C' order # data return np.dstack([im.astype(np.uint8)] * 3).copy(order='C')

### Method 4: Using weave

Another way to do this is to use the weave module (included with SciPy) to write C code inline, which gets compiled to native code. Since many of the numpy functions are native code anyway it's not obvious this is going to make any difference, but it's something to try.def to_rgb4(im): # we use weave to do the assignment in C code # this only gets compiled on the first call import scipy.weave as weave w, h = im.shape ret = np.empty((w, h, 3), dtype=np.uint8) code = """ int impos=0; int retpos=0; for(int j=0; j<Nim[1]; j++) { for (int i=0; i<Nim[0]; i++) { unsigned char d=im[impos++]; ret[retpos++] = d; ret[retpos++] = d; ret[retpos++] = d; } } """ weave.inline(code, ["im", "ret"]) return ret

### Method 5: Using repeat

def to_rgb5(im): im.resize((im.shape[0], im.shape[1], 1)) return np.repeat(im.astype(np.uint8), 3, 2)

### Testing and timing

To test the above methods, I used inspection to find any functions beginning 'to_rgb' and tested them on data returned from the numpy.random.uniform function (which returns float64 data) for various image sizes, after checking all functions gave identical output.The full code is here:

import numpy as np import time import sys def to_rgb1(im): # I think this will be slow w, h = im.shape ret = np.empty((w, h, 3), dtype=np.uint8) ret[:, :, 0] = im ret[:, :, 1] = im ret[:, :, 2] = im return ret def to_rgb1a(im): # This should be fsater than 1, as we only # truncate to uint8 once (?) w, h = im.shape ret = np.empty((w, h, 3), dtype=np.uint8) ret[:, :, 2] = ret[:, :, 1] = ret[:, :, 0] = im return ret def to_rgb1b(im): # I would expect this to be identical to 1a w, h = im.shape ret = np.empty((w, h, 3), dtype=np.uint8) ret[:, :, 0] = im ret[:, :, 1] = ret[:, :, 2] = ret[:, :, 0] return ret def to_rgb2(im): # as 1, but we use broadcasting in one line w, h = im.shape ret = np.empty((w, h, 3), dtype=np.uint8) ret[:, :, :] = im[:, :, np.newaxis] return ret def to_rgb3(im): # we can use dstack and an array copy # this has to be slow, we create an array with # 3x the data we need and truncate afterwards return np.asarray(np.dstack((im, im, im)), dtype=np.uint8) def to_rgb3a(im): # we can use the same array 3 times, converting to # uint8 first # this explicitly converts to np.uint8 once and is short return np.dstack([im.astype(np.uint8)] * 3) def to_rgb3b(im): # as 3a, but we add an extra copy to contiguous 'C' order # data return np.dstack([im.astype(np.uint8)] * 3).copy(order='C') def to_rgb4(im): # we use weave to do the assignment in C code # this only gets compiled on the first call import scipy.weave as weave w, h = im.shape ret = np.empty((w, h, 3), dtype=np.uint8) code = """ int impos=0; int retpos=0; for(int j=0; j<Nim[1]; j++) { for (int i=0; i<Nim[0]; i++) { unsigned char d=im[impos++]; ret[retpos++] = d; ret[retpos++] = d; ret[retpos++] = d; } } """ weave.inline(code, ["im", "ret"]) return ret def to_rgb5(im): im.resize((im.shape[0], im.shape[1], 1)) return np.repeat(im.astype(np.uint8), 3, 2) funcs = dict(((x,eval(x)) for x in list(globals()) if "to_rgb" in x)) print "testing Numpy v",np.version.version print "on Python ", sys.version for size in [64,256,1024,2048]: s = np.random.uniform(256, size=(size,size)) op = [funcs[f](s) for f in funcs] assert(all(map(lambda x: np.array_equal(x, op[0]), op))) times=min(3*10**7/size**2, 10**5) or 1 print "\n\nFor Size",size,"\n===============" for name, func in sorted(funcs.items()): start = time.time() for i in range(times): func(s) end=time.time() print "%s took \t%0.3f ms average for %d times" % ( name, (1000*(end-start)/times),times)

### Results

The results were a little bit unexpected, and highlight the idiom that testing is the only real way to gauge performance. The following table lists the average time in ms to process a 1024x1024 image. The Windows results were using a version of numpy compiled against MKL as found here. All results were from Python 2.7.3.Architecture | to_rgb1 | to_rgb1a | to_rgb1b | to_rgb2 | to_rgb3 | to_rgb3a | to_rgb3b | to_rgb4 | to_rgb5 |
---|---|---|---|---|---|---|---|---|---|

MacBookPro Win64/Python32 | 14.8 | 14.8 | 8.9 | 15.9 | 28.8 | 8.7 | 9.5 | 5.9 | 26.7 |

MacBookPro Win64/Python64 | 3.3 | 3.2 | 4.7 | 6.8 | 17.4 | 3.8 | 4.8 | 3.7 | 21.1 |

MacBookPro OSX64/Python32 | 2.6 | 2.6 | 3.7 | 9.2 | 20.1 | 2.9 | 3.4 | 1.4 | 19.9 |

MacBookPro OSX64/Python64 | 2.7 | 2.6 | 3.6 | 5.7 | 16.7 | 2.8 | 4.1 | 2.1 | 18.4 |

Xeon W3520 Win64/Python32 | 9.2 | 9.2 | 8.4 | 22.4 | 54.1 | 12.3 | 40.3 | 6.8 | |

Xeon W3520 Win64/Python64 | 5.7 | 5.3 | 7.2 | 13.7 | 51.3 | 10.9 | 30.0 | 4.4 |

The first thing to notice is just how varied these different methods are. to_rgb3 is the only obviously sub-optimal implementation, and it shows, being the slowest for any configuration. There is a surprising difference in the performance under Windows 7 vs MacOSX for the same hardware, with the Mac performing noticeably better. to_rgb3a works well under most configurations, but the overhead of the extra copy in rgb_3b varies from being relatively minor (~10% for Win64/Python32) to awful (~200% for the Xeon Win64/Python64). The Python32 results under Win64 are awful for the MacBook, but relatively better for the Xeon machine.

1 and 1a produced identical results, with 1b working noticeably better under Win64/Python32 but worse in the other configurations. Method 2 was surprisingly slow, especially considering it was one call into C code.

In general the results above are just confusing, and vary significantly on different platforms and different architectures. Methods 1/1a give generally decent results, and I can't think of a good reason not to recommend them, so I'd have to go with 1a as the least worst most generally optimal solution, and the one I'd tentatively recommend. The weave results are generally good, but they do require a compiler to be installed and set up on the machine in use. Another thing to note is that converting to 8-bit doesn't coerce values, so you need to be careful to do this yourself if needed. In this case using weave to coerce and assign all at once may well give the best performance.