No Python. No PyTorch dependency. Drop a single header and get a complete training pipeline — backed by a custom slab allocator and Apple AMX acceleration.
Every design decision in Sandokan traces back to a single constraint: training must be fast, deterministic, and portable — without dragging in a Python runtime.
All gradient buffers are served from a pre-allocated contiguous slab. Zero malloc / free during training. No heap fragmentation over long runs.
Batched GEMM via Apple Accelerate and AMX co-processors. Combined with the slab allocator, this is the engine's primary performance lever.
Compose typed submodules with Submodule<T>. Auto-registers with the parent on construction — you cannot forget a register call.
ImageDataset pages images on demand — RSS stays bounded regardless of dataset size. TabularDataset handles numeric CSVs with column-major storage.
SGD, Adam, and LinearLR schedulers out of the box. Training loops handle shuffling, partial-batch skipping, and scheduler stepping automatically.
Compact binary model files with a 4-word header, optional normalisation block, and DFS-traversal weight layout. Load in one call.
Networks are plain C++ structs inheriting from Module. Submodules auto-register — the topology is known at construction time, so the slab allocator can compute sizes before any data moves.
init_pmad_for()struct ResBlock : Module {
Submodule<Linear> fc1 { *this, 64, 64 };
ReLU relu1;
Submodule<Linear> fc2 { *this, 64, 64 };
ReLU relu2;
MatrixXf forward(const MatrixXf& x) override {
return relu2.forward(
fc2.forward(
relu1.forward(fc1.forward(x))
)
) + x; // residual skip
}
};
// One call allocates the entire gradient slab
LetterNet net;
init_pmad_for(net);
Adam optim(1e-3f);
LinearLR sched(optim, 150, 1e-5f);
train_module(net, sched, train, test, 150, 128);Apple Silicon (M-series) · EIGEN_USE_BLAS · Architecture 784→64→64→26 · batch=128 · EMNIST Letters, 124 800 training samples.
| Backend | Total (ms) | ms / epoch | ms / sample | Samples / sec |
|---|---|---|---|---|
| Eigen single-sample | 9 257 | 1 851 | 0.0148 | 67 408 |
| Sandokan single-sample | 7 540 | 1 508 | 0.0121 | 82 757 |
| Eigen batched | 614 | 123 | 0.0010 | 1 015 951 |
| Sandokan batched + parallelFASTEST | 386 | 77 | 0.0006 | 1 615 666 |
Sandokan's batched path runs at 1.6 M samples/sec — 19.5× faster than single-sample Sandokan and 1.5× faster than plain Eigen batched. Fashion MNIST shows a further 1.74 M samples/sec at 34.4 ms/epoch.
The quickest path. Installs Sandokan and its Eigen dependency.
Use find_package and link the sandokan::sandokantarget. That's it.
One CMake flag unlocks Apple Accelerate / AMX for a significant speed boost on Apple Silicon.
#include <sandokan.h> and you have the full training API.
# 1. Install
brew install sandokan
# 2. Build your project
cmake -B build .
cmake --build build -j
# CMakeLists.txt
find_package(sandokan REQUIRED)
target_link_libraries(your_target
PRIVATE sandokan::sandokan)
# 3. Enable AMX (Apple Silicon)
target_compile_definitions(sandokan
INTERFACE EIGEN_USE_BLAS)
target_link_libraries(sandokan
INTERFACE "-framework Accelerate")
# 4. In your code
#include <sandokan.h>