DGX Spark LLM Benchmark Results

System: NVIDIA DGX Spark (ProMax GB10)

GPU: NVIDIA GB10 (Grace Hopper, Compute Capability 12.1)

Memory: 119.64 GB Unified (Shared CPU+GPU)

Architecture: ARM Cortex (X925 + A725, 20 cores)

Test Runs: 60 total (3 models × 2 environments × 10 iterations)

🔍 Key Findings

Memory Overhead: Docker containers consume 20-31 GB more memory than native execution due to Grace Hopper unified memory architecture. Docker's cgroup accounting double-counts GPU allocations.
KV Cache Impact: Native execution provides 1.6-2.7x more KV cache (17-28 GB more), enabling better throughput scaling and longer context handling.
Performance Parity: Both environments achieve identical throughput (~119-120 tokens/sec), with standard deviation < 0.6 tokens/sec. No performance penalty for native execution.
Model Efficiency: Qwen2.5-72B shows the best memory efficiency (70.03 GB peak, 44.71 GB KV cache), using less memory than the 7B DeepSeek model despite being 10x larger.
Consistency: All 60 runs showed excellent thermal management (59-61°C) and performance stability (σ < 0.6 tokens/sec).
Root Cause: Grace Hopper's unified memory architecture causes Docker's cgroups to double-count GPU allocations as container RAM. This overhead does not exist on discrete GPU systems (H100/A100).

For Large Models (>10B): Use native/chroot execution on DGX Spark for optimal memory efficiency. The 20-30 GB savings and 1.6-2.7x KV cache advantage are significant for production workloads.
For Small Models (<10B): Docker is acceptable if deployment convenience outweighs ~30 GB overhead.
For Discrete GPUs: This finding is specific to Grace Hopper unified memory. Traditional discrete GPU systems (H100, A100) should not exhibit this pattern.
Phase 2 Investigation: Test cgroup-level solutions, systemd-nspawn alternatives, and compare with discrete GPU systems.