Overview#

CuPy is a NumPy/SciPy-compatible array library for GPU-accelerated computing with Python. CuPy acts as a drop-in replacement to run existing NumPy/SciPy code on NVIDIA CUDA or AMD ROCm platforms.

CuPy provides a ndarray, sparse matrices, and the associated routines for GPU devices, all having the same API as NumPy and SciPy:

N-dimensional array (ndarray): cupy.ndarray
- Data types (dtypes): boolean (bool_), integer (int8, int16, int32, int64, uint8, uint16, uint32, uint64), float (float16, float32, float64), and complex (complex64, complex128)
- Supports the semantics identical to numpy.ndarray, including basic / advanced indexing and broadcasting
Sparse matrices: cupyx.scipy.sparse
- 2-D sparse matrix: csr_matrix, coo_matrix, csc_matrix, and dia_matrix
NumPy Routines
- Module-level Functions (cupy.*)
- Linear Algebra Functions (cupy.linalg.*)
- Fast Fourier Transform (cupy.fft.*)
- Random Number Generator (cupy.random.*)
SciPy Routines
- Discrete Fourier Transforms (cupyx.scipy.fft.* and cupyx.scipy.fftpack.*)
- Advanced Linear Algebra (cupyx.scipy.linalg.*)
- Multidimensional Image Processing (cupyx.scipy.ndimage.*)
- Sparse Matrices (cupyx.scipy.sparse.*)
- Sparse Linear Algebra (cupyx.scipy.sparse.linalg.*)
- Special Functions (cupyx.scipy.special.*)
- Signal Processing (cupyx.scipy.signal.*)
- Statistical Functions (cupyx.scipy.stats.*)

Routines are backed by CUDA libraries (cuBLAS, cuFFT, cuSPARSE, cuSOLVER, cuRAND), Thrust, CUB, and cuTENSOR to provide the best performance.

It is also possible to easily implement custom CUDA kernels that work with ndarray using:

Kernel Templates: Quickly define element-wise and reduction operation as a single CUDA kernel
Raw Kernel: Import existing CUDA C/C++ code
Just-in-time Transpiler (JIT): Generate CUDA kernel from Python source code
Kernel Fusion: Fuse multiple CuPy operations into a single CUDA kernel

CuPy can run in multi-GPU or cluster environments. The distributed communication package (cupyx.distributed) provides collective and peer-to-peer primitives for ndarray, backed by NCCL.

For users who need more fine-grain control for performance, accessing low-level CUDA features are available:

Stream and Event: CUDA stream and per-thread default stream are supported by all APIs
Memory Pool: Customizable memory allocator with a built-in memory pool
Profiler: Supports profiling code using CUDA Profiler and NVTX
Host API Binding: Directly call CUDA libraries, such as NCCL, cuDNN, cuTENSOR, and cuSPARSELt APIs from Python

CuPy implements standard APIs for data exchange and interoperability, such as DLPack, CUDA Array Interface, __array_ufunc__ (NEP 13), __array_function__ (NEP 18), and Array API Standard. Thanks to these protocols, CuPy easily integrates with NumPy, PyTorch, TensorFlow, MPI4Py, and any other libraries supporting the standard.

Under AMD ROCm environment, CuPy automatically translates all CUDA API calls to ROCm HIP (hipBLAS, hipFFT, hipSPARSE, hipRAND, hipCUB, hipThrust, RCCL, etc.), allowing code written using CuPy to run on both NVIDIA and AMD GPU without any modification.

Project Goal#

The goal of the CuPy project is to provide Python users GPU acceleration capabilities, without the in-depth knowledge of underlying GPU technologies. The CuPy team focuses on providing:

A complete NumPy and SciPy API coverage to become a full drop-in replacement, as well as advanced CUDA features to maximize the performance.
Mature and quality library as a fundamental package for all projects needing acceleration, from a lab environment to a large-scale cluster.