Documentation Index
Fetch the complete documentation index at: https://mintlify.com/sgl-project/sglang/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Kubernetes provides robust orchestration for SGLang deployments, enabling auto-scaling, self-healing, and declarative configuration. This guide covers single-node and distributed multi-node deployments.
Prerequisites
- Kubernetes cluster version ≥1.26
- NVIDIA GPU Operator or device plugin installed
kubectl configured for cluster access
- Storage class for persistent volumes (for model caching)
GPU Support Setup
Install NVIDIA device plugin if not already available:
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml
Verify GPU nodes:
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"
Single-Node Deployment
Basic Deployment
Deploy a single-replica SGLang server:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: llama-31-8b-sglang
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 30Gi
storageClassName: default # change this to your preferred storage class
volumeMode: Filesystem
---
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: nvidia
handler: nvidia
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: meta-llama-31-8b-instruct-sglang
spec:
replicas: 1
strategy:
type: Recreate
selector:
matchLabels:
app: meta-llama-31-8b-instruct-sglang
template:
metadata:
labels:
app: meta-llama-31-8b-instruct-sglang
model: meta-llama-31-8b-instruct
engine: sglang
spec:
restartPolicy: Always
runtimeClassName: nvidia
containers:
- name: meta-llama-31-8b-instruct-sglang
image: docker.io/lmsysorg/sglang:latest
imagePullPolicy: Always # IfNotPresent or Never
ports:
- containerPort: 30000
command: ["python3", "-m", "sglang.launch_server"]
args:
[
"--model-path",
"meta-llama/Llama-3.1-8B-Instruct",
"--host",
"0.0.0.0",
"--port",
"30000",
]
env:
- name: HF_TOKEN
value: <secret>
resources:
limits:
nvidia.com/gpu: 1
cpu: 8
memory: 40Gi
requests:
cpu: 2
memory: 16Gi
nvidia.com/gpu: 1
volumeMounts:
- name: shm
mountPath: /dev/shm
- name: hf-cache
mountPath: /root/.cache/huggingface
- name: localtime
mountPath: /etc/localtime
readOnly: true
livenessProbe:
httpGet:
path: /health
port: 30000
initialDelaySeconds: 120
periodSeconds: 15
timeoutSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /health_generate
port: 30000
initialDelaySeconds: 120
periodSeconds: 15
timeoutSeconds: 10
failureThreshold: 3
successThreshold: 1
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: 10Gi
- name: hf-cache
persistentVolumeClaim:
claimName: llama-31-8b-sglang
- name: localtime
hostPath:
path: /etc/localtime
type: File
---
apiVersion: v1
kind: Service
metadata:
name: meta-llama-31-8b-instruct-sglang
spec:
selector:
app: meta-llama-31-8b-instruct-sglang
ports:
- protocol: TCP
port: 80 # port on host
targetPort: 30000 # port in container
type: LoadBalancer # change to ClusterIP if needed
Save as sglang-deployment.yaml and apply:
kubectl apply -f sglang-deployment.yaml
Verify Deployment
# Check pod status
kubectl get pods -l app=meta-llama-31-8b-instruct-sglang
# View logs
kubectl logs -f deployment/meta-llama-31-8b-instruct-sglang
# Check service endpoint
kubectl get svc meta-llama-31-8b-instruct-sglang
Multi-Node Distributed Deployment
Using StatefulSet
For multi-node tensor parallelism across nodes:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: distributed-sglang
spec:
replicas: 2 # number of nodes/pods to run distributed sglang
selector:
matchLabels:
app: distributed-sglang
serviceName: ""
template:
metadata:
labels:
app: distributed-sglang
spec:
containers:
- name: sglang-container
image: docker.io/lmsysorg/sglang:latest
imagePullPolicy: Always
command:
- /bin/bash
- -c
args:
- |
python3 -m sglang.launch_server \
--model /llm-folder \
--dist-init-addr sglang-master-pod:5000 \
--tensor-parallel-size 16 \
--nnodes 2 \
--node-rank $POD_INDEX \
--trust-remote-code \
--host 0.0.0.0 \
--port 8000 \
--enable-metrics \
--expert-parallel-size 16
env:
- name: POD_INDEX # reflects the node-rank
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.labels['apps.kubernetes.io/pod-index']
- name: NCCL_DEBUG
value: INFO
resources:
limits:
nvidia.com/gpu: "8"
volumeMounts:
- mountPath: /dev/shm
name: dshm
- mountPath: /llm-folder
name: llm
securityContext:
privileged: true # to leverage RDMA/InfiniBand device
hostNetwork: true
volumes:
- emptyDir:
medium: Memory
sizeLimit: 10Gi
name: dshm
- hostPath:
path: /llm-folder # replace with PVC or hostPath with your model weights
type: DirectoryOrCreate
name: llm
---
apiVersion: v1
kind: Service
metadata:
name: sglang-master-pod
spec:
type: ClusterIP
selector:
app: distributed-sglang
apps.kubernetes.io/pod-index: "0"
ports:
- name: dist-port
port: 5000
targetPort: 5000
---
apiVersion: v1
kind: Service
metadata:
name: sglang-serving-on-master
spec:
type: NodePort
selector:
app: distributed-sglang
apps.kubernetes.io/pod-index: "0"
ports:
- name: serving
port: 8000
targetPort: 8000
- name: metrics
port: 8080
targetPort: 8080
Apply the configuration:
kubectl apply -f distributed-sglang.yaml
LeaderWorkerSet (LWS) Deployment
LeaderWorkerSet is the recommended approach for multi-node distributed inference.
Prerequisites
Install LeaderWorkerSet controller:
kubectl apply --server-side -f https://github.com/kubernetes-sigs/lws/releases/download/v0.6.0/manifests.yaml
Verify installation:
kubectl get crd leaderworkersets.leaderworkerset.x-k8s.io
Basic LWS Configuration
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
name: sglang
spec:
replicas: 1
leaderWorkerTemplate:
size: 2
restartPolicy: RecreateGroupOnPodRestart
leaderTemplate:
metadata:
labels:
role: leader
spec:
dnsPolicy: ClusterFirstWithHostNet
hostNetwork: true
hostIPC: true
containers:
- name: sglang-leader
image: lmsysorg/sglang:latest
securityContext:
privileged: true
env:
- name: NCCL_IB_GID_INDEX
value: "3"
command:
- python3
- -m
- sglang.launch_server
- --model-path
- /work/models
- --tp
- "16"
- --dist-init-addr
- $(LWS_LEADER_ADDRESS):20000
- --nnodes
- $(LWS_GROUP_SIZE)
- --node-rank
- $(LWS_WORKER_INDEX)
- --trust-remote-code
- --host
- "0.0.0.0"
- --port
- "40000"
resources:
limits:
nvidia.com/gpu: "8"
ports:
- containerPort: 40000
readinessProbe:
tcpSocket:
port: 40000
initialDelaySeconds: 15
periodSeconds: 10
volumeMounts:
- mountPath: /dev/shm
name: dshm
- name: model
mountPath: /work/models
- name: ib
mountPath: /dev/infiniband
volumes:
- name: dshm
emptyDir:
medium: Memory
- name: model
hostPath:
path: /path/to/models
- name: ib
hostPath:
path: /dev/infiniband
workerTemplate:
spec:
dnsPolicy: ClusterFirstWithHostNet
hostNetwork: true
hostIPC: true
containers:
- name: sglang-worker
image: lmsysorg/sglang:latest
securityContext:
privileged: true
env:
- name: NCCL_IB_GID_INDEX
value: "3"
command:
- python3
- -m
- sglang.launch_server
- --model-path
- /work/models
- --tp
- "16"
- --dist-init-addr
- $(LWS_LEADER_ADDRESS):20000
- --nnodes
- $(LWS_GROUP_SIZE)
- --node-rank
- $(LWS_WORKER_INDEX)
- --trust-remote-code
resources:
limits:
nvidia.com/gpu: "8"
volumeMounts:
- mountPath: /dev/shm
name: dshm
- name: model
mountPath: /work/models
- name: ib
mountPath: /dev/infiniband
volumes:
- name: dshm
emptyDir:
medium: Memory
- name: ib
hostPath:
path: /dev/infiniband
- name: model
hostPath:
path: /path/to/models
---
apiVersion: v1
kind: Service
metadata:
name: sglang-leader
spec:
selector:
leaderworkerset.sigs.k8s.io/name: sglang
role: leader
ports:
- protocol: TCP
port: 40000
targetPort: 40000
Deploy with LWS
kubectl apply -f sglang-lws.yaml
Monitor LWS Deployment
# Check LeaderWorkerSet status
kubectl get lws sglang
# View all pods
kubectl get pods -l leaderworkerset.sigs.k8s.io/name=sglang
# Check leader logs
kubectl logs -f sglang-0
# Check worker logs
kubectl logs -f sglang-0-1
RDMA/InfiniBand Configuration
For high-performance multi-node setups with RDMA:
Prerequisites
- Verify InfiniBand devices on nodes:
- Check RDMA accessibility:
RDMA-Enabled Deployment
spec:
template:
spec:
hostNetwork: true
hostIPC: true
containers:
- name: sglang
securityContext:
privileged: true
capabilities:
add:
- IPC_LOCK
env:
- name: NCCL_IB_GID_INDEX
value: "3"
- name: NCCL_IB_QPS_PER_CONNECTION
value: "8"
- name: NCCL_IB_SPLIT_DATA_ON_QPS
value: "1"
- name: NCCL_NET_PLUGIN
value: "none"
- name: NCCL_IB_HCA
value: "^=mlx5_0,mlx5_5,mlx5_6"
- name: NCCL_DEBUG
value: "INFO"
volumeMounts:
- name: ib
mountPath: /dev/infiniband
volumes:
- name: ib
hostPath:
path: /dev/infiniband
Storage Configuration
Persistent Volume for Model Cache
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache
spec:
accessModes:
- ReadWriteMany # Required for multi-pod access
resources:
requests:
storage: 100Gi
storageClassName: nfs-storage # Use appropriate storage class
Using hostPath (Development)
volumes:
- name: model-cache
hostPath:
path: /data/models
type: DirectoryOrCreate
Using NFS (Production)
volumes:
- name: model-cache
nfs:
server: nfs-server.example.com
path: /exports/models
Resource Management
Resource Requests and Limits
resources:
requests:
cpu: "4"
memory: "32Gi"
nvidia.com/gpu: "1"
limits:
cpu: "8"
memory: "64Gi"
nvidia.com/gpu: "1"
Node Selection
nodeSelector:
gpu-type: nvidia-a100
node-role: inference
Tolerations
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
Monitoring and Observability
Enable Metrics
Add metrics endpoint to your deployment:
args:
- --enable-metrics
- --metrics-port
- "8080"
Prometheus Integration
apiVersion: v1
kind: Service
metadata:
name: sglang-metrics
labels:
app: sglang
spec:
selector:
app: sglang
ports:
- name: metrics
port: 8080
targetPort: 8080
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: sglang-metrics
spec:
selector:
matchLabels:
app: sglang
endpoints:
- port: metrics
interval: 30s
Scaling
Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: sglang-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: sglang-deployment
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 80
Troubleshooting
Pod Stuck in Pending
# Check events
kubectl describe pod <pod-name>
# Check GPU availability
kubectl describe nodes | grep -A 5 "Allocated resources"
NCCL Communication Failures
# Enable NCCL debug logs
env:
- name: NCCL_DEBUG
value: "TRACE"
# Check network connectivity between pods
kubectl exec -it <pod-name> -- ping <other-pod-ip>
RDMA Issues
# Verify RDMA devices in container
kubectl exec -it <pod-name> -- ibv_devices
kubectl exec -it <pod-name> -- ibv_devinfo
# Check RDMA link status
kubectl exec -it <pod-name> -- rdma link show
Out of Memory
# Increase shared memory
volumes:
- name: dshm
emptyDir:
medium: Memory
sizeLimit: 20Gi # Increase as needed
Best Practices
- Use StatefulSet for multi-node: StatefulSets provide stable network identities
- Enable hostNetwork for RDMA: Required for high-performance inter-node communication
- Set privileged mode for InfiniBand: Necessary for RDMA device access
- Use ReadWriteMany PVCs: Enable model sharing across pods
- Configure health probes: Implement both liveness and readiness probes
- Set resource limits: Prevent resource contention
- Use specific image tags: Avoid
latest in production
- Monitor NCCL environment: Tune based on network topology
Next Steps