Neural networks at native C++ speed.

No Python. No PyTorch dependency. Drop a single header and get a complete training pipeline — backed by a custom slab allocator and Apple AMX acceleration.

Zero AllocationC++17Apple AMXCMake ≥ 3.15Eigen 3Single Header

Read the Docs Quick Install

bash — 80×24

Core Principles

Engineered with intention.

Every design decision in Sandokan traces back to a single constraint: training must be fast, deterministic, and portable — without dragging in a Python runtime.

Memory

PMAD Slab Allocator

All gradient buffers are served from a pre-allocated contiguous slab. Zero malloc / free during training. No heap fragmentation over long runs.

Compute

Apple AMX Acceleration

Batched GEMM via Apple Accelerate and AMX co-processors. Combined with the slab allocator, this is the engine's primary performance lever.

API

PyTorch-style Modules

Compose typed submodules with Submodule<T>. Auto-registers with the parent on construction — you cannot forget a register call.

Data

mmap-backed Datasets

ImageDataset pages images on demand — RSS stays bounded regardless of dataset size. TabularDataset handles numeric CSVs with column-major storage.

Training

Optimizers & Schedulers

SGD, Adam, and LinearLR schedulers out of the box. Training loops handle shuffling, partial-batch skipping, and scheduler stepping automatically.

Persist

Custom .sand Format

Compact binary model files with a 4-word header, optional normalisation block, and DFS-traversal weight layout. Load in one call.

Module System

Define networks by composing.

Residual blocks. First class.

Networks are plain C++ structs inheriting from Module. Submodules auto-register — the topology is known at construction time, so the slab allocator can compute sizes before any data moves.

Auto-registration on construction
Residual connections via operator overloads
Topology-derived slab sizing via init_pmad_for()
Forward and backward in pure C++

struct ResBlock : Module {
    Submodule<Linear> fc1 { *this, 64, 64 };
    ReLU              relu1;
    Submodule<Linear> fc2 { *this, 64, 64 };
    ReLU              relu2;

    MatrixXf forward(const MatrixXf& x) override {
        return relu2.forward(
            fc2.forward(
                relu1.forward(fc1.forward(x))
            )
        ) + x; // residual skip
    }
};

// One call allocates the entire gradient slab
LetterNet net;
init_pmad_for(net);

Adam     optim(1e-3f);
LinearLR sched(optim, 150, 1e-5f);
train_module(net, sched, train, test, 150, 128);

Benchmarks

Numbers that matter.

Apple Silicon (M-series) · EIGEN_USE_BLAS · Architecture 784→64→64→26 · batch=128 · EMNIST Letters, 124 800 training samples.

Backend	Total (ms)	ms / epoch	ms / sample	Samples / sec
Eigen single-sample	9 257	1 851	0.0148	67 408
Sandokan single-sample	7 540	1 508	0.0121	82 757
Eigen batched	614	123	0.0010	1 015 951
Sandokan batched + parallelFASTEST	386	77	0.0006	1 615 666

Sandokan's batched path runs at 1.6 M samples/sec — 19.5× faster than single-sample Sandokan and 1.5× faster than plain Eigen batched. Fashion MNIST shows a further 1.74 M samples/sec at 34.4 ms/epoch.

Get Started

Up and running in minutes.

Install via Homebrew

The quickest path. Installs Sandokan and its Eigen dependency.

Link in CMake

Use find_package and link the sandokan::sandokantarget. That's it.

Enable AMX (optional)

One CMake flag unlocks Apple Accelerate / AMX for a significant speed boost on Apple Silicon.

Drop in the header

#include <sandokan.h> and you have the full training API.

Full Documentation →

# 1. Install
brew install sandokan

# 2. Build your project
cmake -B build .
cmake --build build -j

# CMakeLists.txt
find_package(sandokan REQUIRED)
target_link_libraries(your_target
    PRIVATE sandokan::sandokan)

# 3. Enable AMX (Apple Silicon)
target_compile_definitions(sandokan
    INTERFACE EIGEN_USE_BLAS)
target_link_libraries(sandokan
    INTERFACE "-framework Accelerate")

# 4. In your code
#include <sandokan.h>