Introduction

Dealing with numpy arrays that have missing data is a challenge. When calculating statistics on them or otherwise processing data in them you need to skip elements that are missing - often they are some very high (or low) nodata value and will skew the result.

Using NaNs

In floating point data, there is a special number: NaN (“not a number”) that you can use to signify that this value can’t be processed for whatever reason. Numpy has a collection of functions (they all start with “nan”) that ignore any NaNs in your data. You an also test for individual elements being NaN with numpy.isnan. However, what happens if you are dealing with integer data? Well, setting integer elements to NaN fails. The alternative is to convert the whole integer array you are dealing with to float and back again. This adds time and memory use. Also performing operations on floats is slower than on integers.

Numpy Masked Arrays

There is an alternative - numpy masked arrays. These are arrays that also store a mask of where data is not valid:

>>> import numpy
>>> x = numpy.array([[1, 5, 9999], [-980, 7, 9], [11, 61, 9923]])
>>> mx = numpy.ma.masked_array(x, mask=[[False, False, True], [True, False, False], [False, False, True]])
>>> mx
masked_array(
  data=[[1, 5, --],
        [--, 7, 9],
        [11, 61, --]],
  mask=[[False, False,  True],
        [ True, False, False],
        [False, False,  True]],
  fill_value=999999)

It is important to note the mask parameter is True where the data is masked. Masked arrays support many of the numpy methods:

>>> mx.max()
np.int64(61)

Note how the masked out values aren’t used in the calculations. You can also turn a masked array back into a normal array using the filled method:

>>> mx.filled(-99)
array([[  1,   5, -99],
       [-99,   7,   9],
       [ 11,  61, -99]])

If your data has a single value that represents “no data” then you can pass this in to the masked_values function to create a masked array where that value is masked:

x = numpy.array([1, -99, 23, 78, -99])
>>> mx = numpy.ma.masked_values(x, -99)
>>> mx
masked_array(data=[1, --, 23, 78, --],
             mask=[False,  True, False, False,  True],
       fill_value=-99)

Notes on using Numba with masked arrays

Numba doesn’t know anything about masked arrays - you get an Unsupported array type: numpy.ma.MaskedArray error when you pass one in to a Numba function. However, a masked array is made up of two normal arrays: the data and the mask. You can pass these in separately to a Numba function:

@njit
def docalc(data, mask):
    tot = 0
    for x in range(data.shape[0]):
        if not mask[x]:
            tot += data[x]
    return tot
    
x = numpy.array([1, -99, 23, 78, -99])
mx = numpy.ma.masked_values(x, -99)
result = docalc(mx.data, mx.mask)

However, if your data just has a single “no data” value it may be easier just to pass this value in and compare each element to it instead of using masked arrays.

Conclusion

Numba masked arrays can be a useful tool when dealing with missing data. They provide a lighter weight alternative to conversion to float arrays and setting NaN.