Slurm Factory Architecture
This document provides a comprehensive overview of slurm-factoryβs relocatable build architecture, optimization strategies, and implementation details focused on performance and portability.
Overview
slurm-factory is designed around relocatable package generation using an optimized build pipeline. The architecture prioritizes build speed through intelligent caching, dependency classification, and container reuse while producing portable packages that work across diverse HPC environments.
Core Architecture Flow
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Optimized Build Pipeline β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Base Image β LXD Copy β Cache Mount β Spack Bootstrap β
β (Cached) β (Fast) β (Persistent)β (Cached) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Dependency Classification & Build β
β External Tools β Runtime Libraries Fresh β
β (System Pkgs) β (Arch-Optimized) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Relocatable Package Assembly β
β Software TAR β Module TAR β Dynamic Prefix β
β (2-25GB) β (4KB) β (Runtime Override) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Build Optimization Strategy
1. Container Efficiency
- Base Image Reuse: Single Ubuntu 24.04 base container across all builds
- LXD Copy Operations: Fast container duplication instead of fresh installs
- Persistent Mounts: Cache directories mounted across container lifecycles
2. Multi-Layer Caching
Build Cache (Binary Packages)
βββ spack-buildcache/ # Compiled package binaries
βββ spack-sourcecache/ # Downloaded source archives
βββ binary_index/ # Dependency resolution cache
βββ ccache/ # Compiler object cache
Performance Impact:
- First Build: 45-90 minutes (full compilation)
- Subsequent Builds: 5-15 minutes (>10x speedup)
- Cache Hit Ratio: 80-95% for common dependencies
3. Dependency Classification Strategy
slurm-factory uses intelligent dependency classification to optimize both build time and package size:
π§ External Tools (Build-Time Only)
System packages used during compilation but not included in final package:
externals:
cmake: {spec: "cmake@3.28.3", prefix: "/usr"}
autoconf: {spec: "autoconf@2.71", prefix: "/usr"}
automake: {spec: "automake@1.16.5", prefix: "/usr"}
gcc: {spec: "gcc@13.3.0", prefix: "/usr"}
pkg-config: {spec: "pkg-config@0.29.2", prefix: "/usr"}
Benefits:
- Fast Builds: Leverage optimized system packages
- Reduced Compilation: No need to rebuild standard tools
- Smaller Packages: Build tools excluded from final package
β‘ Runtime Libraries (Fresh Builds)
Critical dependencies compiled with architecture-specific optimizations:
buildable: true
packages:
munge: # Authentication daemon (security critical)
json-c: # JSON parsing (runtime linked)
curl: # HTTP client for REST API
openssl: # SSL/TLS encryption (runtime linked)
hwloc: # Hardware topology detection
readline: # Interactive CLI support
lz4: # Compression (runtime linked)
Benefits:
- Performance: Architecture-specific optimizations (SSE, AVX)
- Security: Fresh builds with latest patches
- Compatibility: Guaranteed version compatibility with Slurm
4. Compilation Acceleration
spack_config:
build_jobs: 4 # Parallel compilation
ccache: true # Compiler object caching
build_stage: "/tmp/spack-stage" # Fast tmpfs builds
performance_optimizations:
parallel_builds: 4 # Concurrent package builds
compiler_cache: enabled # C/C++ object caching
hardlink_views: true # Fast file operations
unify_concretizer: true # Optimal dependency resolution
Impact:
- Parallel Compilation: 4x faster builds on multi-core systems
- ccache: 50-80% reduction in C/C++ compilation time
- Hardlinks: 10x faster than symlink-based package views
Relocatable Module Architecture
Dynamic Prefix Implementation
slurm-factory generates relocatable Lmod modules using environment variable substitution:
-- Dynamic path resolution with fallback
prepend_path("PATH", "${SLURM_INSTALL_PREFIX:-{prefix}}/bin")
prepend_path("LD_LIBRARY_PATH", "${SLURM_INSTALL_PREFIX:-{prefix}}/lib")
-- Runtime configuration
setenv("SLURM_ROOT", "${SLURM_INSTALL_PREFIX:-{prefix}}")
setenv("SLURM_PREFIX", "${SLURM_INSTALL_PREFIX:-{prefix}}")
-- Build metadata for introspection
setenv("SLURM_BUILD_TYPE", "standard build")
setenv("SLURM_VERSION", "25-05-1-1")
Module Context System
module_context:
installation_root: "{prefix}"
redistributable: true
supports_relocation: true
relocation_variable: "SLURM_INSTALL_PREFIX"
Deployment Flexibility:
# Default deployment
module load slurm/25.05
# Custom path deployment
export SLURM_INSTALL_PREFIX=/shared/apps/slurm
module load slurm/25.05
# Container deployment
export SLURM_INSTALL_PREFIX=/app/slurm
module load slurm/25.05
Package Structure & Optimization
Software Package Architecture
slurm-25.05-software.tar.gz (~2-25GB)
βββ software/
β βββ bin/ # Slurm executables
β β βββ srun, sbatch, squeue
β β βββ sinfo, scancel, scontrol
β β βββ salloc, sacct, sreport
β βββ sbin/ # Slurm daemons
β β βββ slurmd, slurmctld
β β βββ slurmdbd, slurmrestd
β β βββ slurm-wlm-configurator
β βββ lib/ # Runtime libraries
β β βββ libslurm.so* # Core Slurm library
β β βββ slurm/ # Plugin libraries
β β β βββ accounting_storage_*.so
β β β βββ job_submit_*.so
β β β βββ select_*.so
β β βββ runtime dependencies
β β βββ libmunge.so*
β β βββ libjson-c.so*
β β βββ libcurl.so*
β β βββ libssl.so*
β βββ include/ # Development headers
β β βββ slurm/
β βββ share/ # Documentation & configs
β βββ man/
β βββ doc/
Module Package Structure
slurm-25.05-module.tar.gz (~4KB)
βββ modules/
β βββ slurm/
β βββ 25.05.lua # Relocatable Lmod module
βββ modulefiles/ # Alternative hierarchy
βββ slurm/
βββ 25.05
Module Features:
- Relocatable Paths: Dynamic prefix support
- Build Metadata: Version, type, capabilities
- Conflict Resolution: Prevents multiple Slurm versions
- Help Text: Usage instructions and customization hints
Manages application settings and cache directory structure.
Settings Class:
@dataclass
class Settings:
project_name: str # LXD project name
@property
def home_cache_dir(self) -> Path: # ~/.slurm-factory/
@property
def builds_dir(self) -> Path: # ~/.slurm-factory/builds/
@property
def spack_buildcache_dir(self) -> Path: # ~/.slurm-factory/spack-buildcache/
@property
def spack_sourcecache_dir(self) -> Path: # ~/.slurm-factory/spack-sourcecache/
Directory Structure:
~/.slurm-factory/
βββ builds/ # Built packages output
βββ spack-buildcache/ # Spack binary cache
βββ spack-sourcecache/ # Spack source cache
4. Constants and Configuration (constants.py
)
Defines all build constants, paths, and script templates.
Supported Slurm Versions:
SLURM_VERSIONS = {
"25.05": "25-05-1-1",
"24.11": "24-11-6-1",
"23.11": "23-11-11-1",
"23.02": "23-02-7-1",
}
Container Paths:
CONTAINER_CACHE_DIR = "/opt/slurm-factory-cache"
CONTAINER_PATCHES_DIR = "/srv/global-patches"
CONTAINER_SPACK_PROJECT_DIR = "/root/spack-project"
CONTAINER_SLURM_DIR = "/opt/slurm"
CONTAINER_BUILD_OUTPUT_DIR = "/opt/slurm-factory-cache/builds"
Script Templates:
- Cache setup scripts
- Patch installation scripts
- Spack environment setup
- Build and package creation scripts
5. Spack Configuration (spack_yaml.py
)
Handles Spack environment configuration and YAML generation.
Key Functions:
- generate_yaml_string(): Creates Spack environment YAML
- Package specification and variant management
- Compiler configuration
- Build optimization settings
Container Architecture
LXD Integration
slurm-factory uses the craft-providers
library for LXD container management.
Container Lifecycle:
Base Image (ubuntu:24.04) β Base Instance Creation β
Build Instance Launch β Setup & Build β Package Extraction β Cleanup
Base Instance Strategy:
- Reusable base instances with common dependencies
- 90-day expiry for automatic cleanup
- Cached Spack installations for faster builds
Build Instance Isolation:
- Unique instances per build to prevent conflicts
- Automatic cleanup after successful builds
- Resource limits and security constraints
Container Environment
Image Configuration:
- Base Image: Ubuntu 24.04 LTS
- Remote: ubuntu (official images)
- Architecture: x86_64 (primary), ARM64 (experimental)
Mounted Directories:
Host Container
~/.slurm-factory/ β /opt/slurm-factory-cache/
Container Setup Process:
- Cloud-init configuration and boot
- Package manager updates
- Development tools installation
- Spack bootstrap and setup
- Cache directory configuration
- Patch file installation
Build Process Architecture
Phase 1: Environment Preparation
Cache Setup:
# Create cache directories with proper permissions
mkdir -p /opt/slurm-factory-cache/{spack-buildcache,spack-sourcecache,builds}
chmod -R 755 /opt/slurm-factory-cache/
Patch Installation:
# Install Slurm package patches
mkdir -p /srv/global-patches/
# Copy slurm_prefix.patch and package.py
Phase 2: Spack Bootstrap
Spack Installation:
# Clone and configure Spack
git clone -b v1.0.0 https://github.com/spack/spack.git /opt/spack
source /opt/spack/share/spack/setup-env.sh
Environment Setup:
# spack.yaml configuration
spack:
specs:
- slurm@{version} ^openmpi ^pmix
concretizer:
unify: true
config:
build_stage: /tmp/spack-stage
Phase 3: Slurm Compilation
Build Process:
# Create Spack environment
spack env create slurm-{version}
spack env activate slurm-{version}
# Install dependencies and Slurm
spack install slurm@{version}
# Create build cache
spack buildcache create --unsigned slurm
Optimization Features:
- Architecture-specific optimizations
- Parallel compilation
- Dependency caching
- Build cache reuse
Phase 4: Package Creation
Software Package:
# Create software tarball
spack location -i slurm | tar -czf slurm-{version}-software.tar.gz
Module Files:
# Generate Environment Modules
spack module tcl refresh
tar -czf slurm-{version}-modules.tar.gz modules/
Data Flow Architecture
Build Data Flow
graph TD
A[CLI Command] --> B[Settings Configuration]
B --> C[LXD Provider Setup]
C --> D[Container Launch]
D --> E[Environment Setup]
E --> F[Spack Bootstrap]
F --> G[Patch Installation]
G --> H[Package Build]
H --> I[Cache Creation]
I --> J[Package Assembly]
J --> K[Output Extraction]
K --> L[Container Cleanup]
Cache Data Flow
graph LR
A[Host Cache] --> B[Container Mount]
B --> C[Spack Build Cache]
C --> D[Source Cache]
D --> E[Build Artifacts]
E --> F[Output Packages]
F --> A
Storage Architecture
Host Storage Layout
~/.slurm-factory/
βββ builds/ # Final build outputs
β βββ 25.05/
β β βββ slurm-25.05-software.tar.gz
β β βββ slurm-25.05-modules.tar.gz
β βββ 24.11/
βββ spack-buildcache/ # Binary package cache
β βββ build_cache/
β βββ _pgp/
βββ spack-sourcecache/ # Source code cache
βββ archive/
βββ git/
Container Storage Layout
Container Filesystem:
βββ /opt/spack/ # Spack installation
βββ /opt/slurm-factory-cache/ # Mounted host cache
βββ /srv/global-patches/ # Patch files
βββ /root/spack-project/ # Build environment
βββ /opt/slurm/ # Slurm installation
βββ /srv/build-output/ # Package staging
Performance Architecture
Build Optimization
Parallel Processing:
- Multi-core compilation via Spack
- Parallel dependency builds
- Container resource allocation
- I/O optimization with NVMe
Caching Strategy:
- Source Cache: Downloaded source archives
- Build Cache: Compiled binary packages
- Container Cache: Reusable base instances
- Incremental Builds: Dependency reuse
Resource Management
Memory Optimization:
- Container memory limits
- Spack memory usage controls
- Garbage collection strategies
- Build staging optimization
Storage Optimization:
- Compressed package formats
- Cache size management
- Automatic cleanup policies
- Efficient file transfer
Security Architecture
Container Security
Isolation Boundaries:
- Process isolation via LXD
- Filesystem namespace isolation
- Network namespace isolation
- User namespace mapping
Security Controls:
- Resource limits and quotas
- AppArmor/SELinux profiles
- Capability restrictions
- Read-only system mounts
Build Security
Source Verification:
- Checksum validation for downloads
- Trusted package repositories
- Reproducible build environments
- Build artifact integrity
Access Control:
- User-specific cache directories
- Container process ownership
- Secure temporary directories
- Permission management
Scalability Architecture
Horizontal Scaling
Multi-Host Deployment:
- Distributed LXD clusters
- Shared storage backends
- Build farm coordination
- Load balancing strategies
Parallel Builds:
- Concurrent build instances
- Resource pool management
- Build queue management
- Result aggregation
Vertical Scaling
Resource Scaling:
- Dynamic CPU allocation
- Memory scaling policies
- Storage expansion
- Network bandwidth management
Integration Architecture
CI/CD Integration
Pipeline Integration:
# GitHub Actions example
- name: Build Slurm Package
run: |
uv run slurm-factory build --slurm-version 25.05
# Upload artifacts
Automation Points:
- Triggered builds on version updates
- Automated testing and validation
- Package deployment automation
- Notification and monitoring
Deployment Integration
Package Deployment:
# Extract to target system
tar -xzf slurm-25.05-software.tar.gz -C /opt/
tar -xzf slurm-25.05-modules.tar.gz -C /usr/share/modules/
# Configure environment
module load slurm/25.05
System Integration:
- Module system compatibility
- Shared filesystem deployment
- Configuration management
- Service orchestration
Error Handling Architecture
Build Error Recovery
Error Categories:
- Container launch failures
- Network connectivity issues
- Compilation errors
- Resource exhaustion
Recovery Strategies:
- Automatic retry with exponential backoff
- Container recreation for recoverable errors
- Partial build state recovery
- Graceful degradation modes
Monitoring and Logging
Logging Strategy:
- Structured logging with timestamps
- Multiple log levels (DEBUG, INFO, WARN, ERROR)
- Component-specific loggers
- Build process tracing
Error Reporting:
- Detailed error messages with context
- Build failure analysis
- Performance metrics collection
- Debug information capture
This architecture enables slurm-factory to provide reliable, scalable, and maintainable HPC software deployments while maintaining flexibility for diverse deployment scenarios.