Skip to content

GPU Acceleration

The package detects available hardware at import time and dispatches compute-heavy operations to the fastest available backend.


Backend priority

CUDA (CuPy)  →  CPU (NumPy / SciPy)

Apple MPS (Metal) does not yet expose a NumPy-compatible ndimage API suitable for zoom and shift operations, so it falls through to CPU.

Checking the active backend

import multimodal_registration as mr
print(mr.backend_name())   # 'cuda' or 'cpu'

Or from the CLI:

multimodal-registration
# multimodal-registration  [backend: cuda]

Installing CuPy

Pick the wheel that matches your installed CUDA toolkit:

# CUDA 12.x
pip install "multimodal-registration[cuda12]"

# CUDA 11.x
pip install "multimodal-registration[cuda11]"

Verify that CuPy can see your GPU:

import cupy as cp
cp.zeros(1)          # raises if no GPU is found
print(cp.cuda.runtime.getDeviceCount())

What runs on GPU

Operation CPU GPU
dct.upscale() scipy.ndimage.zoom cupyx.scipy.ndimage.zoom
dct.shift() scipy.ndimage.shift cupyx.scipy.ndimage.shift
pct.mask binary closing scipy.ndimage.binary_closing cupyx.scipy.ndimage.binary_closing
register() cross-correlation skimage.registration.phase_cross_correlation cucim.skimage.registration.phase_cross_correlation

The deformation gradient computation (numpy.gradient, numpy.linalg.svd), map_coordinates warping, and IPF colour computation (orix) run on CPU regardless of backend — these are either not bottlenecks or not yet supported by CuPy/cuCIM.


Memory considerations

Upscaling a DCT volume by a factor of ~3× in each dimension increases memory by ~27×. For large datasets:

  • Process one volume at a time if GPU VRAM is limited.
  • The to_device / to_numpy helpers in backends.py move arrays explicitly — you can stage data manually if needed.
from multimodal_registration import backends

d_arr = backends.to_device(my_numpy_array)   # send to GPU
result = backends.to_numpy(d_arr)            # retrieve