Benchmarks

The primary advantage of using fastrad over traditional CPU-bound libraries is performance. The repository contains a rigorous benchmarking suite run_benchmark.py that validates both computation time and numeric stability.

fastrad accelerates intensive texture matrix constructions (e.g., GLCM, GLRLM, GLDM) dramatically, scaling exceptionally well on NVIDIA GPUs.

IBSI Compliance and Numerical Parity

fastrad implements 116 mathematical features strictly conforming to the Image Biomarker Standardisation Initiative (IBSI) guidelines.

Numerical Parity with PyRadiomics (TCIA Clinical Image)
Feature Class	Mean Abs Diff	Max Abs Diff	Features Within 1e-4
firstorder	3.20e-16	4.44e-15	16
shape	9.93e-15	1.14e-13	14
glcm	7.12e-13	1.09e-11	24
glrlm	2.05e-15	1.42e-14	16
glszm	2.66e-15	2.49e-14	16
gldm	3.05e-15	3.91e-14	14
ngtdm	2.26e-17	8.33e-17	5

Outlier Analysis: All features across all classes are strictly within the designated 1e-4 parity tolerance. 100% compliant with the Phase 1 digital phantom.

GPU Performance

Below is a direct measurement comparing a standard PyRadiomics extraction against fastrad utilizing an NVIDIA GPU (Intel i9 14th Gen, RTX 4070 Ti, 96GB RAM).

(Metrics taken from a real clinical TCIA segmentation mask, comparing native single-threaded PyRadiomics feature calculation against `fastrad` PyTorch CUDA offloading)

GPU Runtime Acceleration (TCIA Clinical Mask)
Feature Class	PyRadiomics (CPU)	fastrad (CUDA)	Speedup
Firstorder	0.4079s	0.0083s	49.3x
Shape	0.4114s	0.0117s	35.0x
GLCM	0.4175s	0.0210s	19.9x
GLRLM	0.4135s	0.0320s	12.9x
GLSZM	0.4129s	0.0183s	22.5x
GLDM	0.4209s	0.0113s	37.1x
NGTDM	0.4119s	0.0130s	31.6x
TOTAL	2.8961s	0.1157s	~25.0x

CPU Performance

Even without a dedicated GPU, fastrad natively optimizes matrix shifts and crops via tensor algorithms to cleanly exceed typical single-threaded PyRadiomics evaluation instances.

CPU Runtime Acceleration (TCIA Clinical Mask)
Hardware Architecture	PyRadiomics (1t)	fastrad (1t)
Apple M3 (ARM)	2.99s	0.78s (3.8x)
Intel Core i9 14th Gen (x86)	2.89s	1.10s (2.6x)

Multi-threading Fairness Benchmark

PyRadiomics CPU (Single Thread): 2.89s
PyRadiomics CPU (32 Threads): 2.88s
fastrad CPU (Single Thread): 1.10s

=> Comparative Advantage (fastrad 1t vs PyRadiomics 32t): 2.63x speedup

Note: PyRadiomics is not internally parallelised at the feature computation level; threading only affects SimpleITK image operations. This explains the observed lack of scaling.

ROI Size Scaling Benchmark (GPU)

GPU Scaling Speedup
Radius (mm)	Voxel Count	PyRadiomics Total (s)	fastrad GPU Total (s)	Speedup
5	199	2.908s	0.112s	25.8x
10	2249	2.936s	0.138s	21.2x
15	8263	2.941s	0.155s	18.9x
20	20181	3.031s	0.198s	15.2x
25	38327	3.111s	0.251s	12.3x
30	67461	3.271s	0.337s	9.6x

Optimizing GLSZM (cuCIM)

By default, the Gray Level Size Zone Matrix (GLSZM) relies heavily on Connected-Component union-find algorithms that often struggle with atomic contention on GPU hardware.

fastrad is specifically architected with a hybrid bypass framework that evaluates your hardware configuration. On CPU targets, fastrad utilizes a highly efficient bounding-box pre-crop strategy via scipy.ndimage to isolate gray level structures prior to connected-components labeling, resulting in extreme CPU processing speeds that comfortably outclass traditional scalar baselines. If the pipeline detects a CUDA target, it will attempt to route the GLSZM generation uniquely through RAPIDS cuCIM (cucim.core.operations.morphology.label) for true heterogeneous acceleration, cleanly circumventing the tensor-loop bottleneck.

Stability Guarantee

fastrad includes a rigorous feature reproducibility and stability analysis utilizing the RIDER Lung CT scan-rescan pairs to compute Intraclass Correlation Coefficients (ICC) alongside physical tensor perturbations (Translation and Gaussian Noise).

ICC Analysis on Real RIDER Scan-Rescan Pairs

Fastrad Features with ICC >= 0.90: 10.7%
PyRadiomics Features with ICC >= 0.90: 8.7%
Fastrad Mean ICC: 0.3619
PyRadiomics Mean ICC: 0.3530
Wilcoxon signed-rank test: stat=647.0000, p=0.4109

Numerical Robustness to Input Perturbation

Numerical Robustness
Perturbation	PyRadiomics Mean Drift	fastrad Mean Drift	Failure Count
Gaussian Noise	10.58%	10.20%	0
Translation	228.77%	219.83%	0

Memory Footprint Optimization

Because of its dense tensor streaming architecture prioritizing evaluation speed, fastrad fundamentally trades peak CPU memory footprint for significant reductions in execution time. By materializing full dense tensors throughout computation instead of sequential voxel loops, at an ROI diameter of 30mm (67k voxels), fastrad requires substantially more peak CPU RAM (~7.6GB) compared to PyRadiomics (<1GB). It is highly recommended to use the GPU pathway or smaller batch chunks on systems with limited physical resources or when processing massive whole-organ segmentation volumes.

GPU VRAM Profile (Full Pipeline)

Peak GPU VRAM
Feature Class	Peak VRAM Allocated (MB)
firstorder	116.47
shape	627.48
glcm	356.58
glrlm	263.34
glszm	116.47
gldm	361.76
ngtdm	654.78
FULL PIPELINE	654.78

Edge Case Handling

Handling of Edge Cases
Edge Case	Expected Behaviour	fastrad Behaviour	PyRadiomics Behaviour
Empty Mask	ValueError	ValueError	ValueError
Single-voxel ROI	Exception / Graceful	Graceful Completion	ValueError
Very Small ROI (<8 voxels)	Exception / Graceful	Graceful Completion	Graceful Completion
Non-isotropic Spacing	UserWarning	Graceful + Warning	Graceful Completion

Dense Voxel-Wise Hardware Extraction Performance

This section evaluates the runtime extraction performance scaling of fastrad when evaluating sliding windows densely across a large clinical tissue volume block, producing explicit multi-channel natively tracked spatial PyTorch Tensor maps instead of single scalar representations.

Hardware Evaluation Config: CPU

Volume Profile: 64x64x64 Matrix (262,144 physical spatial locations)

Voxel-Wise Feature Map Generation Runtime
Kernel Size (Voxel)	Stride	Result Shape Map	Window Evaluations Executed	Execution Time (s)
32^3	16	[3x3x3]	27	0.78s
24^3	8	[6x6x6]	216	3.16s
16^3	4	[13x13x13]	2197	15.71s

Note: Output feature maps are evaluated strictly to valid mathematical patches exclusively. `DenseFeatureExtractor` prevents padding out of bound math pollution, executing highly deterministic memory-strided patch views via `PyTorch F.unfold` framework logic.

For full rigorous automated metrics across our combined validation setup, refer to the generated scientific report utilizing the run_benchmark.py scripts.