Benchmarks
The primary advantage of using fastrad over traditional CPU-bound libraries is performance. The repository contains a rigorous benchmarking suite run_benchmark.py that validates both computation time and numeric stability.
fastrad accelerates intensive texture matrix constructions (e.g., GLCM, GLRLM, GLDM) dramatically, scaling exceptionally well on NVIDIA GPUs.
IBSI Compliance and Numerical Parity
fastrad implements 116 mathematical features strictly conforming to the Image Biomarker Standardisation Initiative (IBSI) guidelines.
Feature Class |
Mean Abs Diff |
Max Abs Diff |
Features Within 1e-4 |
Features Outside 1e-4 |
|---|---|---|---|---|
firstorder |
3.20e-16 |
4.44e-15 |
16 |
0 |
shape |
9.93e-15 |
1.14e-13 |
14 |
0 |
glcm |
7.12e-13 |
1.09e-11 |
24 |
0 |
glrlm |
2.05e-15 |
1.42e-14 |
16 |
0 |
glszm |
2.66e-15 |
2.49e-14 |
16 |
0 |
gldm |
3.05e-15 |
3.91e-14 |
14 |
0 |
ngtdm |
2.26e-17 |
8.33e-17 |
5 |
0 |
Outlier Analysis: All features across all classes are strictly within the designated 1e-4 parity tolerance. 100% compliant with the Phase 1 digital phantom.
GPU Performance
Below is a direct measurement comparing a standard PyRadiomics extraction against fastrad utilizing an NVIDIA GPU (Intel i9 14th Gen, RTX 4070 Ti, 96GB RAM).
(Metrics taken from a real clinical TCIA segmentation mask, comparing native single-threaded PyRadiomics feature calculation against `fastrad` PyTorch CUDA offloading)
Feature Class |
PyRadiomics (CPU) |
fastrad (CUDA) |
Speedup |
|---|---|---|---|
Firstorder |
0.4079s |
0.0083s |
49.3x |
Shape |
0.4114s |
0.0117s |
35.0x |
GLCM |
0.4175s |
0.0210s |
19.9x |
GLRLM |
0.4135s |
0.0320s |
12.9x |
GLSZM |
0.4129s |
0.0183s |
22.5x |
GLDM |
0.4209s |
0.0113s |
37.1x |
NGTDM |
0.4119s |
0.0130s |
31.6x |
TOTAL |
2.8961s |
0.1157s |
~25.0x |
CPU Performance
Even without a dedicated GPU, fastrad natively optimizes matrix shifts and crops via tensor algorithms to cleanly exceed typical single-threaded PyRadiomics evaluation instances.
Hardware Architecture |
PyRadiomics (1t) |
fastrad (1t) |
|---|---|---|
Apple M3 (ARM) |
2.99s |
0.78s (3.8x) |
Intel Core i9 14th Gen (x86) |
2.89s |
1.10s (2.6x) |
Multi-threading Fairness Benchmark
PyRadiomics CPU (Single Thread): 2.89s
PyRadiomics CPU (32 Threads): 2.88s
fastrad CPU (Single Thread): 1.10s
=> Comparative Advantage (fastrad 1t vs PyRadiomics 32t): 2.63x speedup
Note: PyRadiomics is not internally parallelised at the feature computation level; threading only affects SimpleITK image operations. This explains the observed lack of scaling.
ROI Size Scaling Benchmark (GPU)
Radius (mm) |
Voxel Count |
PyRadiomics Total (s) |
fastrad GPU Total (s) |
Speedup |
|---|---|---|---|---|
5 |
199 |
2.908s |
0.112s |
25.8x |
10 |
2249 |
2.936s |
0.138s |
21.2x |
15 |
8263 |
2.941s |
0.155s |
18.9x |
20 |
20181 |
3.031s |
0.198s |
15.2x |
25 |
38327 |
3.111s |
0.251s |
12.3x |
30 |
67461 |
3.271s |
0.337s |
9.6x |
Optimizing GLSZM (cuCIM)
By default, the Gray Level Size Zone Matrix (GLSZM) relies heavily on Connected-Component union-find algorithms that often struggle with atomic contention on GPU hardware.
fastrad is specifically architected with a hybrid bypass framework that evaluates your hardware configuration. On CPU targets, fastrad utilizes a highly efficient bounding-box pre-crop strategy via scipy.ndimage to isolate gray level structures prior to connected-components labeling, resulting in extreme CPU processing speeds that comfortably outclass traditional scalar baselines. If the pipeline detects a CUDA target, it will attempt to route the GLSZM generation uniquely through RAPIDS cuCIM (cucim.core.operations.morphology.label) for true heterogeneous acceleration, cleanly circumventing the tensor-loop bottleneck.
Stability Guarantee
fastrad includes a rigorous feature reproducibility and stability analysis utilizing the RIDER Lung CT scan-rescan pairs to compute Intraclass Correlation Coefficients (ICC) alongside physical tensor perturbations (Translation and Gaussian Noise).
ICC Analysis on Real RIDER Scan-Rescan Pairs
Fastrad Features with ICC >= 0.90: 10.7%
PyRadiomics Features with ICC >= 0.90: 8.7%
Fastrad Mean ICC: 0.3619
PyRadiomics Mean ICC: 0.3530
Wilcoxon signed-rank test: stat=647.0000, p=0.4109
Numerical Robustness to Input Perturbation
Perturbation |
PyRadiomics Mean Drift |
fastrad Mean Drift |
Failure Count |
|---|---|---|---|
Gaussian Noise |
10.58% |
10.20% |
0 |
Translation |
228.77% |
219.83% |
0 |
Memory Footprint Optimization
Because of its dense tensor streaming architecture prioritizing evaluation speed, fastrad fundamentally trades peak CPU memory footprint for significant reductions in execution time. By materializing full dense tensors throughout computation instead of sequential voxel loops, at an ROI diameter of 30mm (67k voxels), fastrad requires substantially more peak CPU RAM (~7.6GB) compared to PyRadiomics (<1GB). It is highly recommended to use the GPU pathway or smaller batch chunks on systems with limited physical resources or when processing massive whole-organ segmentation volumes.
GPU VRAM Profile (Full Pipeline)
Feature Class |
Peak VRAM Allocated (MB) |
|---|---|
firstorder |
116.47 |
shape |
627.48 |
glcm |
356.58 |
glrlm |
263.34 |
glszm |
116.47 |
gldm |
361.76 |
ngtdm |
654.78 |
FULL PIPELINE |
654.78 |
Edge Case Handling
Edge Case |
Expected Behaviour |
fastrad Behaviour |
PyRadiomics Behaviour |
|---|---|---|---|
Empty Mask |
ValueError |
ValueError |
ValueError |
Single-voxel ROI |
Exception / Graceful |
Graceful Completion |
ValueError |
Very Small ROI (<8 voxels) |
Exception / Graceful |
Graceful Completion |
Graceful Completion |
Non-isotropic Spacing |
UserWarning |
Graceful + Warning |
Graceful Completion |
Dense Voxel-Wise Hardware Extraction Performance
This section evaluates the runtime extraction performance scaling of fastrad when evaluating sliding windows densely across a large clinical tissue volume block, producing explicit multi-channel natively tracked spatial PyTorch Tensor maps instead of single scalar representations.
Hardware Evaluation Config: CPU
Volume Profile: 64x64x64 Matrix (262,144 physical spatial locations)
Kernel Size (Voxel) |
Stride |
Result Shape Map |
Window Evaluations Executed |
Execution Time (s) |
|---|---|---|---|---|
32^3 |
16 |
[3x3x3] |
27 |
0.78s |
24^3 |
8 |
[6x6x6] |
216 |
3.16s |
16^3 |
4 |
[13x13x13] |
2197 |
15.71s |
Note: Output feature maps are evaluated strictly to valid mathematical patches exclusively. `DenseFeatureExtractor` prevents padding out of bound math pollution, executing highly deterministic memory-strided patch views via `PyTorch F.unfold` framework logic.
For full rigorous automated metrics across our combined validation setup, refer to the generated scientific report utilizing the run_benchmark.py scripts.