Inference Guide

Ready-to-deploy recipes for validated open-weight LLMs on Alauda AI. Each model in this guide has been deployed end-to-end on a real cluster and benchmarked, so you get a known-good ServingRuntime + InferenceService pair plus the throughput you can expect per replica.

The models here were validated on Huawei Ascend 910B NPU with the community vLLM-Ascend and Huawei MindIE engines. For the runtime model (KServe ServingRuntime / InferenceService, ModelCar storage, scheduling) see Model Deployment & Inference; for adding your own engine see Extend Inference Runtimes.

Validated models

ModelTypeParamsdtypeDeviceEnginesPer-page
Qwen3-30B-A3BMoE (3B active)30BBF16Ascend 910B3 ×2–4vLLM-Ascend / MindIEQwen3-30B-A3B

Models listed here meet the target serving SLO: at rate=1, chatbot workload, inter-token latency (ITL) P90 ≈ 30 ms. Qwen3-30B-A3B (MoE, 30B total / 3B active, BF16) was validated on Ascend 910B3 at TP=2 and TP=4 with both vLLM-Ascend and MindIE.

Runtime images

EngineDeviceImage (validated tag)Notes
vLLM-AscendHuawei Ascend NPUquay.io/ascend/vllm-ascend:v0.18.0-openeulerCommunity vLLM for Ascend, V1 engine. Pick the tag whose CANN version matches your host driver.
MindIEHuawei Ascend NPUswr.cn-south-1.myhuaweicloud.com/ascendhub/mindie:2.2.RC1-800I-A2-py311-openeuler24.03-ltsHuawei engine (ATB backend). Picks BF16 automatically from the weights. Requires root + writable model volume.
vLLM (NVIDIA)NVIDIA GPU (CUDA)Alauda AI built-in vLLM runtimeThe platform's default GPU runtime. The models in this guide were validated on Ascend NPU; GPU validation at this size is tracked separately.
TIP

CUDA / NVIDIA images are x86-64; the Ascend CANN images are arm64. Always match the runtime image's CANN version to the host NPU driver on your nodes.

Benchmark methodology

All numbers in this guide come from real guidellm runs, not estimates:

  • Open-loop, per-replica. guidellm benchmark --rate-type constant --rate 1..9 --max-seconds 300 --max-error-rate 0.5, measured at replica=1.
  • Four fixed workloads (prompt/output tokens): Chat 512/256, Code 1024/1024, RAG 4096/512, Long RAG 10240/1536.
  • Saturation capacity = the peak achieved RPS across the rate 1→9 sweep.
  • A workload whose capacity is < 1 RPS/replica is reported honestly as such (it cannot sustain even one request per second per replica on this hardware/stack).

Deploy a validated model

Each model page links self-contained YAMLs under assets/ that bundle a namespace-scoped ServingRuntime and the matching InferenceService.

base=https://raw.githubusercontent.com/alauda/aml-docs/master/docs/en/inference_guide/assets

# 1. Edit the file first: set metadata.namespace, the image tag (match your CANN /
#    host driver), and storageUri (a PVC of the weights, or an oci:// ModelCar).
# 2. Apply the ServingRuntime + InferenceService.
kubectl apply -f $base/qwen3-30b-a3b/qwen3-30b-a3b-mindie-tp2.yaml

# 3. Watch it come up (first start loads weights + compiles graphs — can take minutes).
kubectl -n <your-namespace> get inferenceservice -w

# 4. Call the OpenAI-compatible endpoint.
curl -s http://<isvc-url>/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"qwen3-30b-a3b","messages":[{"role":"user","content":"hello"}]}'

Caveats

  • These are namespace-scoped ServingRuntime examples (not ClusterServingRuntime), one per model/engine/TP combination — apply them in the same namespace as the InferenceService that references them.
  • Resource keys are for Ascend 910B3 (huawei.com/Ascend910B3). Adjust the resource key, image, and version fields for your actual NPU model.
  • For Ascend single-node multi-card (TP>1) the runtime image must support HCCL init under the configured Modelcar UID — see the Modelcar permission modes in Extend Inference Runtimes.