Inference Guide
Ready-to-deploy recipes for validated open-weight LLMs on Alauda AI. Each model
in this guide has been deployed end-to-end on a real cluster and benchmarked, so you
get a known-good ServingRuntime + InferenceService pair plus the throughput you
can expect per replica.
The models here were validated on Huawei Ascend 910B NPU with the community
vLLM-Ascend and Huawei MindIE engines. For the runtime model (KServe
ServingRuntime / InferenceService, ModelCar storage, scheduling) see
Model Deployment & Inference; for adding your own
engine see Extend Inference Runtimes.
Validated models
Models listed here meet the target serving SLO: at rate=1, chatbot workload, inter-token latency (ITL) P90 ≈ 30 ms. Qwen3-30B-A3B (MoE, 30B total / 3B active, BF16) was validated on Ascend 910B3 at TP=2 and TP=4 with both vLLM-Ascend and MindIE.
Runtime images
CUDA / NVIDIA images are x86-64; the Ascend CANN images are arm64. Always match the runtime image's CANN version to the host NPU driver on your nodes.
Benchmark methodology
All numbers in this guide come from real guidellm runs, not estimates:
- Open-loop, per-replica.
guidellm benchmark --rate-type constant --rate 1..9 --max-seconds 300 --max-error-rate 0.5, measured atreplica=1. - Four fixed workloads (prompt/output tokens): Chat
512/256, Code1024/1024, RAG4096/512, Long RAG10240/1536. - Saturation capacity = the peak achieved RPS across the rate 1→9 sweep.
- A workload whose capacity is < 1 RPS/replica is reported honestly as such (it cannot sustain even one request per second per replica on this hardware/stack).
Deploy a validated model
Each model page links self-contained YAMLs under
assets/
that bundle a namespace-scoped ServingRuntime and the matching InferenceService.
Caveats
- These are namespace-scoped
ServingRuntimeexamples (notClusterServingRuntime), one per model/engine/TP combination — apply them in the same namespace as theInferenceServicethat references them. - Resource keys are for Ascend 910B3 (
huawei.com/Ascend910B3). Adjust the resource key, image, and version fields for your actual NPU model. - For Ascend single-node multi-card (TP>1) the runtime image must support HCCL init under the configured Modelcar UID — see the Modelcar permission modes in Extend Inference Runtimes.