CIMLinfrastructure

Designing GPU-Aware Build Runners for ML Pipelines on RISC-V Nodes

UUnknown

2026-02-15

11 min read

Build CI runners that schedule ML jobs onto RISC‑V NVLink domains to maximize throughput and minimize cross‑node penalties.

Hook: Your CI is killing ML throughput — especially on heterogeneous RISC-V GPU nodes

ML pipelines are sensitive to GPU topology. Developers and SREs building CI/CD for training and inference face recurring problems: jobs scheduled across GPUs that are slow to talk to each other, inflated epoch times from cross-node penalties, and wasted cost when the scheduler ignores NVLink Fusion domains. With RISC-V server-class silicon entering AI datacenters and NVLink Fusion bridging RISC-V SoCs to NVIDIA GPUs (announced in early 2026), teams must redesign CI runners and schedulers to be GPU- and NVLink-aware — otherwise the performance advantage of these new platforms won’t be realized.

Executive summary — what to do first (inverted pyramid)

If you only do three things this quarter:

Discover and expose GPU topology (NVLink domains, peer-to-peer reachability, NIC/GPU locality) as node metadata.
Make CI runners topology-aware so multi-GPU jobs are co-located inside the same NVLink domain or routed over the highest-bandwidth path.
Measure and iterate with NCCL, DCGM and Prometheus metrics to validate throughput and tune placement policies.

Why this matters in 2026

Late 2025 and early 2026 saw two important shifts: broader adoption of RISC-V in server designs and vendor moves to integrate NVLink into heterogeneous platforms (e.g., SiFive’s NVLink Fusion work). That means datacenter fabrics will increasingly contain RISC-V nodes tightly coupled to NVIDIA accelerators. In practice this lowers CPU-GPU communication overheads — but only if software stacks (CI runners, Kubernetes device plugins, schedulers) understand and exploit NVLink’s topology. Absent that, ML jobs can suffer from cross-node synchronization penalties and unpredictable tail latency.

High-level architecture for NVLink-aware CI runners

Designing a CI runner for ML on RISC-V + NVLink nodes requires five coordinated subsystems:

Topology discovery — identify NVLink connectivity, PCIe/NIC affinity, and GPU peer groups.
Node labeling and feature advertising — expose topology as labels/annotations consumable by schedulers and runners.
Scheduler integration — extend or configure the scheduler to prefer intra-domain placements.
Runner execution logic — the CI runner (Buildkite agent, GitLab Runner, GH Actions self-hosted, Tekton worker) must request the right resources and set runtime environment variables (NCCL, UCX) to optimize inter-GPU comms.
Observability & feedback — measure actual bandwidth, latencies, and epoch times to validate placement decisions and drive autoscaling.

Key assumptions and constraints

RISC-V nodes run a Linux kernel with GPU driver support and userspace (NVIDIA drivers) compatible with the chosen GPUs.
Kubernetes (or your orchestration system) is the control plane for scheduling; if you use custom runners directly on VMs, the same discovery and placement concepts apply.
NVLink domains may exist inside a node (typical) and, with NVLink Fusion, across nodes — schedulers must treat both as high-bandwidth locality boundaries.

Step 1 — Discovering NVLink domains and GPU topology

The first operational task is to build reliable topology discovery. Use vendor tools and expose results through a small agent.

Tools and APIs

nvidia-smi topo -m (where supported) shows peer-to-peer connectivity and NVLink links.
DCGM (Data Center GPU Manager) provides telemetry and topology if the driver is present.
NVIDIA Management Library (NVML) via python bindings can enumerate devices and query link information programmatically.
PCI topology: lspci and sysfs can be used for fallback detection where NVML isn’t available.

Implementing a discovery agent

Run a lightweight agent on each RISC-V node (systemd service or daemonset in k8s) that:

Queries NVML/DCGM to build a GPU graph (nodes are GPUs, edges are NVLink links).
Generates a compact description: groups of GPUs that form an NVLink domain, inter-domain bandwidth classes (local, NVLink, RDMA), and NIC affinity.
Publishes the description as node labels and an annotation or writes to the cloud provider’s metadata service.

Example label keys (suggested):
k8s.node/bus-topology: "nvlink-domain:0:gpus=0,1;nvlink-domain:1:gpus=2,3"
k8s.node/nvlink-capacity: "domain0=200GBps;inter-node=100GBps"

Step 2 — Advertise topology to the scheduler and runners

Once topology is available, you must make scheduling decisions dependent on it.

Node labels and device plugin coordination

Use the Kubernetes device plugin model (or equivalent) to advertise GPU resources along with topology labels. For multi-cluster or custom orchestrators, expose the same metadata via API endpoints consumed by the CI runner.

Labeling patterns

Per-domain label: topology.k8s.io/nvlink-domain=domain-0
Inter-domain bandwidth class: node.k8s.io/nvlink-bandwidth=200Gbps
GPU affinity: node.k8s.io/gpu-count=4

Step 3 — Make CI runners topology-aware

Your CI runner must translate a job’s resource needs into placement constraints. For ML training jobs that use multiple GPUs, prefer scheduling within a single NVLink domain. For inference micro-batch workloads, it may be better to scatter jobs to maximize parallelism.

Runner-level tactics

Job classification: The runner should classify a job by communication pattern (model parallel vs data parallel, NCCL ring all-reduce vs parameter server). Use this to pick co-located vs distributed placement.
Pod/VM overrides: When launching a pod or VM for a CI job, add nodeSelector/nodeAffinity to prefer certain NVLink domains. Example Kubernetes snippet below.
Fallback policies: If the preferred NVLink domain is busy, consider backoff and queueing instead of falling back to multi-node placement that will stall training.
Gang-scheduling: For multi-process jobs, the runner should request all GPUs at once and use a gang-scheduler (Volcano, kube-batch) or a custom coordinator to avoid partial allocation.

Example (Kubernetes pod snippet):
apiVersion: v1
kind: Pod
metadata:
  name: ml-train-1
spec:
  nodeSelector:
    topology.k8s.io/nvlink-domain: "domain-0"
  containers:
    - name: trainer
      resources:
        limits:
          nvidia.com/gpu: 4

Step 4 — Scheduler integration: prefer NVLink domains

There are multiple ways to integrate topology into scheduling:

Built-in affinity rules — nodeSelector and podAffinity in Kubernetes are simplest.
Scheduler extender or custom scheduler — implement a scheduler extender to compute placement cost based on NVLink graph and return optimal nodes. Useful for cross-node NVLink Fusion where path selection is non-trivial.
Topology-aware Open Source Schedulers — tools like Volcano provide gang-scheduling and priority-aware placement; extend them with NVLink cost models.

Placement cost model

Score candidate placements by expected communication cost. A simple model:

cost = sum(pairwise_comm_volume / bandwidth(path between GPUs))

Where path bandwidth should favor NVLink (intra-node or stitched NVLink Fusion links), then RDMA-capable NIC paths, then standard Ethernet. The scheduler should minimize cost and avoid placements where a single collective spans low-bandwidth links.

Step 5 — Runtime tuning inside the CI runner

Even with perfect placement, runtime environment variables dramatically change performance.

Key runtime knobs

NCCL: set NCCL_P2P_LEVEL and NCCL_SOCKET_IFNAME, tune NCCL_IB_DISABLE, NCCL_NET_GDR_LEVEL if GPUDirect is available.
UCX: enable high-speed transports (verbs, dc), tune eager/rndv thresholds.
CUDA_VISIBLE_DEVICES: map logical device IDs to the GPUs in the NVLink domain to avoid cross-domain communication by accident.
MIG / GPU partitioning: If GPUs support MIG, request partitions aligned with NVLink groups where appropriate.

Example environment block your runner should inject

env:
  - name: NCCL_DEBUG
    value: INFO
  - name: NCCL_P2P_LEVEL
    value: NVL // (or set according to detection)
  - name: NCCL_SOCKET_IFNAME
    value: eth0 (or the RDMA interface)

Observability — measure and validate

Don’t trust assumptions. Build a measurement pipeline that validates whether NVLink-aware placement reduces wall-clock training time and increases throughput. Export DCGM metrics to Prometheus and create Grafana dashboards showing per-job bandwidth and topology-aware placement heatmaps. Use those dashboards to feed autoscaling decisions and to refine scheduler cost weights.

Essential metrics

Inter-GPU bandwidth (from nccl-tests bandwidthTest and DCGM counters).
NCCL ring creation time and collective latencies.
Epoch time and pre/post-processing overhead.
Network interface metrics (RDMA verbs, GPUDirect RDMA stats if enabled).

Monitoring stack

Export DCGM metrics to Prometheus and create Grafana dashboards showing per-job bandwidth and topology-aware placement heatmaps. Use those dashboards to feed autoscaling decisions and to refine scheduler cost weights.

Autoscaling and cost control

GPU pools should be tiered:

High-throughput pool: RISC-V nodes with NVLink Fusion and dense GPU groups; reserved capacity for large multi-GPU training jobs.
Standard pool: PCIe-attached GPUs for single-GPU tasks and inference micro-batches.
Preemptible/spot pool: for CI tasks that can be retried; never place critical multi-GPU training across spot instances unless redundancy and checkpointing are robust.

Autoscaler rules should prefer scaling the high-throughput pool when gang-scheduled multi-GPU jobs are queued, and use queue depth + recent throughput degradation as triggers. Consider caching and placement signals described in serverless caching strategies when designing scaling triggers and warm-pool rules.

Advanced strategies: reduce cross-node penalties

Prefer model and data sharding that matches NVLink domains: partition model shards onto GPUs inside a domain to minimize inter-domain gradients.
Use ZeRO and activation checkpointing: offload optimizer states to NVMe or CPU for memory-bound scenarios rather than spreading across slow links.
Hybrid parallelism: combine intra-domain tensor parallelism (fast NVLink) with inter-domain pipeline stages to reduce synchronization pressure.
Smart prefetching and caching: stage data on nodes within the NVLink domain to avoid network I/O during critical collectives.

Security and multi-tenancy concerns

With NVLink Fusion exposing cross-node GPU fabrics, multi-tenant isolation becomes more complex. Key mitigations:

Use hardware isolation features (MIG) where available.
Enforce strict cgroups and user namespaces for containerized runners.
Implement admission controllers to ensure high-bandwidth NVLink domains are only assigned to authorized projects; consult telemetry trust frameworks for vendor assessments.

Operational checklist before production rollout

Run discovery agents across all RISC-V nodes and verify labels/annotations.
Create a synthetic benchmark suite (nccl-tests, ResNet or transformer micro-bench) and collect baseline metrics for different placement classes.
Implement simple affinity rules that co-locate multi-GPU jobs in an NVLink domain and measure improvements.
Introduce a scheduler extender if placement decisions require cross-node NVLink path evaluation.
Add alerting for bandwidth regressions and failed collectives; include remote telemetry probes for field debugging when needed.

Real-world example: GitLab Runner + Kubernetes + NVLink domains

Pattern for production teams using GitLab CI with Kubernetes executor:

Deploy a discovery daemonset that sets node labels like topology.k8s.io/nvlink-domain.
Use the NVIDIA device plugin to expose GPU resources.
Configure GitLab Runner pod templates that add nodeSelector based on CI job metadata (e.g., job.type=training) and inject NCCL/UCX env vars.
Enable a queue-based fallback: if no domain has sufficient free GPUs, the runner will delay the job and notify the developer with ETA rather than scheduling a slow multi-node run.

GitLab CI job example (conceptual):
job: train-model
  tags: [riscv, nvlink]
  script:
    - run-training.sh
  resources:
    gpus: 4
  runner: k8s-runner-with-nvlink

Measuring success — key performance indicators

Wall-clock time per epoch (target: >20% improvement over non-topology-aware baseline for large models).
Aggregate NIC and GPU interconnect utilization (target: higher utilization on NVLink paths, lower on Ethernet for collectives).
Job failure rates due to network/backpressure (target: reduction after placement logic).
Cost per training job (target: reduce by co-locating and reducing wasted synchronization time).

2026 trends & what’s next

Expect the ecosystem to evolve across three vectors in 2026 and beyond:

Tooling: more device plugins and scheduler heuristics that natively understand NVLink topology and stitch multi-node NVLink fabrics.
Cloud offers: cloud providers and OEMs will introduce RISC-V + NVLink instance types; teams must prepare CI runners to detect instance types and optimize placement.
Standards: community-driven topology descriptors (node annotations for interconnect graphs) will emerge so orchestrators and runners can interoperate.

SiFive’s work on integrating NVIDIA’s NVLink Fusion with RISC-V platforms is an example of how hardware shifts are creating opportunities — but only software-aware orchestration will unlock the latency and throughput gains.

Common pitfalls and how to avoid them

Label sprawl: avoid hundreds of custom labels; use a compact topology descriptor instead of per-GPU labels.
Over-aggressive fallback: don’t allow the runner to silently place multi-GPU jobs across low-bandwidth links — fail fast or queue instead.
Ignoring telemetry: make observability the feedback loop that tunes scheduler weights rather than hard-coding assumptions.
Security gaps: ensure admission control and role-based access prevent unauthorized NVLink domain usage.

Actionable takeaways — immediate next steps for engineering teams

Deploy a topology discovery daemon across a small pilot of RISC-V GPU nodes. Collect NVML/DCGM outputs and export as node metadata.
Implement nodeSelector-based placement in your CI runner for multi-GPU jobs and measure epoch time improvements on a synthetic benchmark.
Instrument with DCGM → Prometheus and create a dashboard showing NVLink vs Ethernet utilization per job.
Prototype a scheduler extender that scores nodes by NVLink path bandwidth; test it in a staging cluster before production rollout.

Final thoughts and call-to-action

RISC-V + NVLink is a real shift — it changes where you place workloads, how you tune runtimes, and what observability you require. The largest gains don’t come from hardware alone but from CI and orchestration that understand topology and prioritize intra-domain placements. If you’re running ML at scale, redesigning your CI runners to be GPU-aware is no longer optional; it’s how you preserve throughput and control cost in mixed-architecture datacenters.

Ready to prototype? Start with a 3-node RISC-V NVLink domain pilot: deploy the discovery agent, add node labels, run nccl-tests, and change your runner to prefer co-located placements. If you want, wecloud.pro offers a ready audit checklist and deployment templates to accelerate the pilot — reach out to get the template repo and a 90-minute architecture session.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.