Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/sgl-project/sglang/llms.txt

Use this file to discover all available pages before exploring further.

Overview

SGLang supports deployment across major cloud platforms, leveraging managed services for Kubernetes, GPUs, and TPUs. This guide covers platform-specific configurations and best practices.

Amazon Web Services (AWS)

AWS SageMaker

AWS SageMaker provides managed inference with built-in SGLang container support.

Prerequisites

  • AWS account with SageMaker access
  • IAM role with SageMaker permissions
  • AWS CLI configured
  • SGLang container on Amazon ECR

Build and Push Container

# Set AWS configuration
export AWS_ACCOUNT="<YOUR_AWS_ACCOUNT>"
export AWS_REGION="<YOUR_AWS_REGION>"
export ECR_REGISTRY="${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com"

# Build SageMaker container
docker build -f docker/sagemaker.Dockerfile -t sglang-sagemaker .

# Tag for ECR
docker tag sglang-sagemaker:latest ${ECR_REGISTRY}/sglang-sagemaker:latest

# Login to ECR
aws ecr get-login-password --region ${AWS_REGION} | \
  docker login --username AWS --password-stdin ${ECR_REGISTRY}

# Create repository if it doesn't exist
aws ecr create-repository --repository-name sglang-sagemaker --region ${AWS_REGION}

# Push image
docker push ${ECR_REGISTRY}/sglang-sagemaker:latest

Deploy Model Endpoint

Use the SageMaker Python SDK:
import sagemaker
from sagemaker import Model

# Initialize SageMaker session
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# Define model
model = Model(
    image_uri=f"{ECR_REGISTRY}/sglang-sagemaker:latest",
    role=role,
    sagemaker_session=sagemaker_session,
    env={
        "SM_SGLANG_MODEL_PATH": "meta-llama/Llama-3.1-8B-Instruct",
        "SM_SGLANG_HOST": "0.0.0.0",
        "SM_SGLANG_PORT": "8080",
        "SM_SGLANG_TP": "1",
        "HF_TOKEN": "<your_hf_token>"
    }
)

# Deploy endpoint
predictor = model.deploy(
    instance_type="ml.g5.xlarge",
    initial_instance_count=1,
    endpoint_name="sglang-llama-endpoint"
)

SageMaker Environment Variables

The SageMaker container uses environment variables with the SM_SGLANG_ prefix:
SM_SGLANG_MODEL_PATH=/opt/ml/model  # Default model path
SM_SGLANG_HOST=0.0.0.0              # Server host
SM_SGLANG_PORT=8080                 # Server port
SM_SGLANG_TP=1                      # Tensor parallelism
All SGLang launch arguments can be set using this pattern:
SM_SGLANG_<ARGUMENT_NAME>=<value>
# Example: --max-running-requests becomes
SM_SGLANG_MAX_RUNNING_REQUESTS=32

Query SageMaker Endpoint

import json
import boto3

runtime = boto3.client("sagemaker-runtime")

response = runtime.invoke_endpoint(
    EndpointName="sglang-llama-endpoint",
    ContentType="application/json",
    Body=json.dumps({
        "model": "meta-llama/Llama-3.1-8B-Instruct",
        "messages": [
            {"role": "user", "content": "What is the capital of France?"}
        ],
        "max_tokens": 100
    })
)

result = json.loads(response["Body"].read())
print(result)

AWS Deep Learning Containers

AWS maintains official SGLang containers with security patches:
# Check available images
aws ecr describe-images \
  --repository-name sglang \
  --registry-id <AWS_DLC_ACCOUNT> \
  --region ${AWS_REGION}
See AWS SGLang DLCs for the latest images.

Amazon EKS

Deploy SGLang on Elastic Kubernetes Service:

Create EKS Cluster

# Install eksctl
curl --silent --location "https://github.com/weaveworks/eksctl/releases/latest/download/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp
sudo mv /tmp/eksctl /usr/local/bin

# Create cluster with GPU nodes
eksctl create cluster \
  --name sglang-cluster \
  --region us-west-2 \
  --nodegroup-name gpu-nodes \
  --node-type p3.2xlarge \
  --nodes 2 \
  --nodes-min 1 \
  --nodes-max 4

Install NVIDIA Device Plugin

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml

Deploy SGLang

Follow the Kubernetes deployment guide with EKS-specific configurations:
apiVersion: v1
kind: Service
metadata:
  name: sglang-service
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
spec:
  type: LoadBalancer
  selector:
    app: sglang
  ports:
    - port: 80
      targetPort: 30000

AWS EC2

Direct deployment on EC2 GPU instances:

Launch GPU Instance

# Launch p3.2xlarge instance with NVIDIA Deep Learning AMI
aws ec2 run-instances \
  --image-id ami-0c55b159cbfafe1f0 \
  --instance-type p3.2xlarge \
  --key-name your-key-pair \
  --security-groups sglang-sg \
  --block-device-mappings 'DeviceName=/dev/sda1,Ebs={VolumeSize=200}'

Install and Run SGLang

# SSH into instance
ssh -i your-key.pem ubuntu@<instance-ip>

# Pull and run Docker container
docker pull lmsysorg/sglang:latest
docker run -d --gpus all -p 30000:30000 \
  lmsysorg/sglang:latest \
  python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 --port 30000

Google Cloud Platform (GCP)

Google Kubernetes Engine (GKE)

Create GKE Cluster with GPUs

# Create cluster
gcloud container clusters create sglang-cluster \
  --zone us-central1-a \
  --machine-type n1-standard-8 \
  --accelerator type=nvidia-tesla-v100,count=1 \
  --num-nodes 2 \
  --enable-autoscaling \
  --min-nodes 1 \
  --max-nodes 4

# Install NVIDIA driver
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml

Deploy SGLang on GKE

apiVersion: apps/v1
kind: Deployment
metadata:
  name: sglang-gke
spec:
  replicas: 1
  selector:
    matchLabels:
      app: sglang
  template:
    metadata:
      labels:
        app: sglang
    spec:
      containers:
      - name: sglang
        image: lmsysorg/sglang:latest
        command:
        - python3
        - -m
        - sglang.launch_server
        - --model-path
        - meta-llama/Llama-3.1-8B-Instruct
        - --host
        - 0.0.0.0
        - --port
        - "30000"
        resources:
          limits:
            nvidia.com/gpu: 1
        volumeMounts:
        - name: cache
          mountPath: /root/.cache
      volumes:
      - name: cache
        emptyDir: {}
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-tesla-v100

Google Cloud TPU

SGLang supports TPU inference through the JAX backend:

Prerequisites

  • TPU v5e, v6e, or v7 instance
  • SGLang-JAX installation

Using SkyPilot

# sky-tpu.yaml
resources:
  cloud: gcp
  accelerators: tpu-v5e-8
  disk_size: 256

setup: |
  pip install "sglang-jax[tpu]"

run: |
  python -m sglang_jax.launch_server \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 30000
Deploy with SkyPilot:
# Install SkyPilot
pip install skypilot-nightly[gcp]

# Configure GCP access
gcloud auth application-default login

# Launch
sky launch -c tpu-cluster sky-tpu.yaml

# Check status
sky status tpu-cluster

Direct TPU VM Setup

# Create TPU VM
gcloud compute tpus tpu-vm create sglang-tpu \
  --zone=us-central2-b \
  --accelerator-type=v5litepod-8 \
  --version=tpu-ubuntu2204-base

# SSH into TPU VM
gcloud compute tpus tpu-vm ssh sglang-tpu --zone=us-central2-b

# Install SGLang-JAX
pip install "sglang-jax[tpu]"

# Launch server
python -m sglang_jax.launch_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 30000

Google Compute Engine

# Create GPU instance
gcloud compute instances create sglang-vm \
  --zone=us-central1-a \
  --machine-type=n1-standard-8 \
  --accelerator=type=nvidia-tesla-v100,count=1 \
  --image-family=ubuntu-2204-lts \
  --image-project=ubuntu-os-cloud \
  --boot-disk-size=200GB \
  --maintenance-policy=TERMINATE

# Install NVIDIA drivers
gcloud compute ssh sglang-vm -- 'curl https://raw.githubusercontent.com/GoogleCloudPlatform/compute-gpu-installation/main/linux/install_gpu_driver.py --output install_gpu_driver.py && sudo python3 install_gpu_driver.py'

# Install Docker and run SGLang
gcloud compute ssh sglang-vm -- 'sudo apt-get update && sudo apt-get install -y docker.io nvidia-container-toolkit && sudo docker run -d --gpus all -p 30000:30000 lmsysorg/sglang:latest python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000'

Microsoft Azure

Azure Kubernetes Service (AKS)

Create AKS Cluster

# Create resource group
az group create --name sglang-rg --location eastus

# Create AKS cluster with GPU nodes
az aks create \
  --resource-group sglang-rg \
  --name sglang-aks \
  --node-count 2 \
  --node-vm-size Standard_NC6s_v3 \
  --generate-ssh-keys

# Get credentials
az aks get-credentials --resource-group sglang-rg --name sglang-aks

# Install NVIDIA device plugin
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml

Deploy SGLang

Use standard Kubernetes manifests from the Kubernetes guide.

Azure VM

# Create GPU VM
az vm create \
  --resource-group sglang-rg \
  --name sglang-vm \
  --image UbuntuLTS \
  --size Standard_NC6s_v3 \
  --admin-username azureuser \
  --generate-ssh-keys

# Open port 30000
az vm open-port --port 30000 --resource-group sglang-rg --name sglang-vm

# SSH and install
az vm run-command invoke \
  --resource-group sglang-rg \
  --name sglang-vm \
  --command-id RunShellScript \
  --scripts "curl -fsSL https://get.docker.com | sh && sudo apt-get install -y nvidia-container-toolkit && sudo docker run -d --gpus all -p 30000:30000 lmsysorg/sglang:latest python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000"

Azure Container Instances

az container create \
  --resource-group sglang-rg \
  --name sglang-aci \
  --image lmsysorg/sglang:latest \
  --cpu 4 \
  --memory 16 \
  --gpu-count 1 \
  --gpu-sku V100 \
  --ports 30000 \
  --command-line "python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000"

Other Cloud Providers

Oracle Cloud Infrastructure (OCI)

# Launch GPU instance
oci compute instance launch \
  --availability-domain <AD> \
  --compartment-id <COMPARTMENT_ID> \
  --shape VM.GPU3.1 \
  --image-id <UBUNTU_IMAGE_ID> \
  --subnet-id <SUBNET_ID>

Alibaba Cloud

# Create ECS instance with GPU
aliyun ecs CreateInstance \
  --RegionId cn-hangzhou \
  --InstanceType ecs.gn6i-c4g1.xlarge \
  --ImageId ubuntu_22_04_x64

Lambda Labs

Lambda Labs provides cost-effective GPU cloud:
# Launch instance via Lambda Cloud dashboard
# SSH into instance
ssh ubuntu@<instance-ip>

# Install SGLang
pip install "sglang[all]"

# Launch server
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 30000

Cloud Storage Integration

AWS S3 for Models

# Download model from S3
import boto3
s3 = boto3.client('s3')
s3.download_file('my-bucket', 'models/llama-8b', '/models/llama-8b')

# Launch with local model
python -m sglang.launch_server --model-path /models/llama-8b

Google Cloud Storage

# Download model
gsutil -m cp -r gs://my-bucket/models/llama-8b /models/

# Launch
python -m sglang.launch_server --model-path /models/llama-8b

Azure Blob Storage

# Install Azure CLI
az storage blob download-batch \
  --account-name mystorageaccount \
  --source models \
  --destination /models/

Cost Optimization

Use Spot/Preemptible Instances

AWS Spot Instances:
aws ec2 run-instances \
  --instance-type p3.2xlarge \
  --instance-market-options MarketType=spot
GCP Preemptible VMs:
gcloud compute instances create sglang-vm \
  --preemptible \
  --machine-type n1-standard-8 \
  --accelerator type=nvidia-tesla-v100,count=1
Azure Spot VMs:
az vm create \
  --priority Spot \
  --max-price -1 \
  --size Standard_NC6s_v3

Auto-Scaling

Implement cluster autoscaling to scale down during low usage:
# AWS EKS
aws eks update-nodegroup-config \
  --cluster-name sglang-cluster \
  --nodegroup-name gpu-nodes \
  --scaling-config minSize=0,maxSize=4,desiredSize=1

# GKE
gcloud container clusters update sglang-cluster \
  --enable-autoscaling \
  --min-nodes 0 \
  --max-nodes 4

Security Best Practices

Network Security

  1. Use private subnets for compute instances
  2. Implement VPC peering for multi-region deployments
  3. Configure security groups to restrict access:
# AWS security group
aws ec2 create-security-group \
  --group-name sglang-sg \
  --description "SGLang security group"

aws ec2 authorize-security-group-ingress \
  --group-name sglang-sg \
  --protocol tcp \
  --port 30000 \
  --cidr 10.0.0.0/16

Secrets Management

AWS Secrets Manager:
import boto3
client = boto3.client('secretsmanager')
response = client.get_secret_value(SecretId='hf-token')
token = response['SecretString']
GCP Secret Manager:
echo -n "your-token" | gcloud secrets create hf-token --data-file=-
kubectl create secret generic hf-token \
  --from-literal=token=$(gcloud secrets versions access latest --secret=hf-token)
Azure Key Vault:
az keyvault secret set \
  --vault-name sglang-vault \
  --name hf-token \
  --value "your-token"

Monitoring and Logging

AWS CloudWatch

import boto3
cloudwatch = boto3.client('cloudwatch')

cloudwatch.put_metric_data(
    Namespace='SGLang',
    MetricData=[
        {
            'MetricName': 'Requests',
            'Value': 100,
            'Unit': 'Count'
        },
    ]
)

GCP Cloud Logging

gcloud logging read "resource.type=k8s_container AND resource.labels.container_name=sglang" \
  --limit 50 \
  --format json

Azure Monitor

az monitor metrics list \
  --resource /subscriptions/<sub-id>/resourceGroups/sglang-rg/providers/Microsoft.ContainerService/managedClusters/sglang-aks \
  --metric CPUUsagePercentage

Next Steps