GitHub Actions CI/CD
Slurm Factory uses GitHub Actions for continuous integration, testing, and automated builds. The workflows maintain the public buildcache, publish releases, and ensure code quality.
Overview
The CI/CD system consists of:
- Slurm Builds - Build and publish Slurm packages with dependencies for each OS toolchain
- Tarball Publishing - Create relocatable tarballs and upload to S3
- Continuous Integration - Run tests, linters, and type checking
- Documentation - Auto-deploy documentation to GitHub Pages
Workflows
1. Compiler Buildcache (build-and-publish-compiler-buildcache.yml)
Purpose: Build GCC compiler toolchains and publish to Spack buildcache
Trigger: Manual workflow dispatch
Workflow Diagram:
Configuration:
name: Build and Publish Compiler Buildcache to S3
on:
workflow_dispatch:
inputs:
compiler_versions:
description: 'Compiler versions (comma-separated or "all")'
required: true
default: 'all'
type: string
env:
S3_BUCKET: slurm-factory-spack-buildcache-4b670
CLOUDFRONT_URL: https://slurm-factory-spack-binary-cache.vantagecompute.ai
Matrix Strategy:
- Builds all requested compiler versions in parallel
- Each build is independent and can succeed/fail separately
- Uses self-hosted runners for performance
- 6-hour timeout per compiler build
Key Steps:
- Checkout: Get latest code from repository
- Setup Python: Install Python from
pyproject.tomlversion - Install uv: Fast Python package manager
- Clean: Remove previous build artifacts
- AWS OIDC: Authenticate with AWS using GitHub OIDC
- Validate GPG: Ensure signing keys are configured
- Build Compiler: Run
slurm-factory build-compiler --publish - Sign Packages: GPG sign all buildcache packages
- Upload to S3: Sync buildcache to S3 bucket
- Test: Install compiler from buildcache and verify
Outputs:
- Compiler buildcache at:
s3://slurm-factory-spack-buildcache-4b670/compilers/{version}/buildcache - Accessible via:
https://slurm-factory-spack-binary-cache.vantagecompute.ai/compilers/{version}/
2. Slurm Dependencies Buildcache (build-and-publish-slurm-all.yml)
Purpose: Build Slurm packages with all dependencies for all compiler combinations
Trigger: Manual workflow dispatch
Workflow Diagram:
Configuration:
name: Build and Publish Slurm Dependencies for All Compilers
on:
workflow_dispatch:
inputs:
slurm_versions:
description: 'Slurm versions (comma-separated or "all")'
required: true
default: 'all'
compiler_versions:
description: 'Compiler versions (comma-separated or "all")'
required: true
default: 'all'
Matrix Strategy:
- Cartesian product: Slurm versions × Compiler versions
- Default: 4 Slurm versions × 8 compilers = 32 parallel builds
- Each combination runs independently
- 8-hour timeout per build (GPU builds can be slow)
Build Options:
--gpu: Include CUDA/ROCm support (~180 packages)--publish=all: Publish all dependencies to buildcache--no-cache: Force fresh Docker build
Outputs:
- Buildcache at:
s3://slurm-factory-spack-buildcache-4b670/slurm/{slurm_version}/{compiler_version}/buildcache - Includes: Slurm, OpenMPI, PMIx, Munge, OpenSSL, HDF5, CUDA, and all dependencies
3. Slurm Tarball Publishing (build-and-publish-slurm-tarball.yml)
Purpose: Build complete Slurm tarballs and publish to public S3 bucket
Trigger: Manual workflow dispatch
Workflow Diagram:
Configuration:
name: Build and Publish Slurm Tarballs to S3
on:
workflow_dispatch:
inputs:
slurm_version:
description: 'Slurm version to build'
required: true
default: '25.11'
gpu_support:
description: 'Enable GPU support'
required: false
default: true
type: boolean
Outputs:
- Tarball:
slurm-{version}-{toolchain}-software.tar.gz - Location:
s3://vantage-public-assets/slurm-factory/{version}/{toolchain}/ - Public URL:
https://vantage-public-assets.s3.amazonaws.com/slurm-factory/{version}/{toolchain}/slurm-{version}-{toolchain}-software.tar.gz
4. CI Tests (ci.yml)
Purpose: Run linters, type checking, and unit tests on every pull request
Trigger: Pull requests to main branch
Jobs:
- commitlint: Validate commit messages follow Conventional Commits
- ci-tests: Run linters, type checker, and unit tests
Steps:
- Checkout code
- Install just (task runner)
- Install uv (package manager)
- Run linters (ruff, codespell)
- Run type checker (pyright)
- Run unit tests (pytest)
Requirements:
- All tests must pass before PR can be merged
- Commit messages must follow
type(scope): messageformat - Code coverage threshold: 80%
5. Documentation Deploy (update-docs.yml)
Purpose: Build and deploy Docusaurus documentation to GitHub Pages
Trigger: Push to main branch or manual dispatch
Steps:
- Checkout repository with full history
- Setup Node.js LTS
- Install Docusaurus dependencies
- Build documentation
- Deploy to GitHub Pages
Output: https://vantagecompute.github.io/slurm-factory
Build Process Diagrams
Compiler Build Process
Slurm Build Process
Self-Hosted Runners
The build workflows use self-hosted runners for:
- Performance: Direct access to high-performance hardware
- Cost: No GitHub Actions minutes consumed
- Docker: Pre-installed Docker for builds
- Storage: Large disk space for build artifacts and caches
- Network: Fast network for S3 uploads
Runner Specifications:
- CPU: 16+ cores
- RAM: 32+ GB
- Disk: 500+ GB SSD
- OS: Ubuntu 24.04 LTS
- Docker: 24.0+
Secrets and Variables
Repository Secrets
| Secret | Purpose | Used By |
|---|---|---|
AWS_ROLE_ARN | GitHub Actions IAM role ARN | All build workflows |
GPG_PRIVATE_KEY | GPG private key for signing | Compiler & Slurm builds |
GPG_KEY_ID | GPG key ID | Compiler & Slurm builds |
Environment Variables
| Variable | Value | Purpose |
|---|---|---|
S3_BUCKET | slurm-factory-spack-buildcache-4b670 | Buildcache S3 bucket |
CLOUDFRONT_URL | https://slurm-factory-spack-binary-cache.vantagecompute.ai | Public CDN URL |
Monitoring and Notifications
Workflow Status
Monitor workflow runs at: https://github.com/vantagecompute/slurm-factory/actions
Notifications
Failed workflows generate:
- GitHub status checks on PRs
- Email notifications to repository admins
- Slack notifications (if configured)
Build Logs
All build logs are retained for:
- 90 days for workflow runs
- Indefinitely for releases
Troubleshooting
Workflow Failures
Check the workflow run logs for:
- Build errors: Look for compilation failures
- Upload errors: Check AWS credentials and permissions
- Test failures: Review test output for issues
Re-running Failed Jobs
# Re-run failed jobs only
gh workflow run build-and-publish-compiler-buildcache.yml \
-f compiler_versions="13.4.0"
# Re-run entire workflow
gh run rerun <run-id>
Debugging Locally
Test workflows locally with act:
# Install act
brew install act # macOS
# or
sudo snap install act # Linux
# Run CI workflow locally
act pull_request
Best Practices
Commit Messages
Follow Conventional Commits:
feat: add new feature
fix: fix bug
docs: update documentation
chore: update dependencies
ci: update CI workflow
Pull Requests
- Create feature branch
- Make changes
- Run
just lintandjust unit - Push and create PR
- Wait for CI to pass
- Request review
- Merge after approval
Release Process
- Update version in
pyproject.toml - Update
CHANGELOG.md - Create git tag:
git tag v1.0.0 - Push tag:
git push origin v1.0.0 - GitHub Actions will create release
- Publish to PyPI:
uv build && uv publish
See Also
- Infrastructure - AWS infrastructure details
- Slurm Factory Spack Build Cache - Using the buildcache
- Contributing - Development guide
- Architecture - Build system overview