Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/sgl-project/sglang/llms.txt

Use this file to discover all available pages before exploring further.

This guide helps you migrate between major versions of SGLang and understand breaking changes.

Overview

SGLang follows semantic versioning (MAJOR.MINOR.PATCH):
  • Major versions: Breaking changes that require code modifications
  • Minor versions: New features with backward compatibility
  • Patch versions: Bug fixes with backward compatibility

Migrating to v0.5.x

Environment Variables

Several environment variables have been deprecated in favor of CLI flags:
These environment variables will be removed in v0.5.7+. Migrate to CLI flags.
Deprecated Env VarReplacement CLI Flag
SGLANG_ENABLE_FLASHINFER_FP8_GEMM--fp8-gemm-backend=flashinfer_trtllm
SGLANG_ENABLE_FLASHINFER_GEMM--fp8-gemm-backend=flashinfer_trtllm
SGLANG_SUPPORT_CUTLASS_BLOCK_FP8--fp8-gemm-backend=cutlass
SGLANG_FLASHINFER_FP4_GEMM_BACKEND--fp4-gemm-backend
SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE--enable-prefill-delayer
SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES--prefill-delayer-max-delay-passes
SGLANG_PREFILL_DELAYER_TOKEN_USAGE_LOW_WATERMARK--prefill-delayer-token-usage-low-watermark
Before:
export SGLANG_ENABLE_FLASHINFER_FP8_GEMM=true
python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct
After:
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --fp8-gemm-backend flashinfer_trtllm

Timeout Configuration

Timeout environment variables have changed from milliseconds to seconds:
Old (milliseconds)New (seconds)
SGLANG_QUEUED_TIMEOUT_MSSGLANG_REQ_WAITING_TIMEOUT
SGLANG_FORWARD_TIMEOUT_MSSGLANG_REQ_RUNNING_TIMEOUT
Before:
export SGLANG_QUEUED_TIMEOUT_MS=300000  # 5 minutes in ms
After:
export SGLANG_REQ_WAITING_TIMEOUT=300  # 5 minutes in seconds

Prefix Migration: SGL_ to SGLANG_

All SGL_ prefixed environment variables are deprecated in favor of SGLANG_: Before:
export SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=true
After:
export SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK=false
The old SGL_ prefix still works but will show deprecation warnings.

Migrating to v0.4.x

Deterministic Inference

A new deterministic inference mode was introduced. If you need reproducible results: Before (v0.3.x):
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --disable-radix-cache
After (v0.4.x):
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --enable-deterministic-inference
See the blog post for details.

MoE Backend Changes

The SGLANG_CUTLASS_MOE environment variable is deprecated: Before:
export SGLANG_CUTLASS_MOE=true
python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3
After:
python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V3 \
  --moe-runner-backend cutlass

Migrating from Other Frameworks

From vLLM

SGLang provides a similar API to vLLM with enhanced performance: vLLM:
from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
outputs = llm.generate(
    ["Tell me a joke"],
    SamplingParams(temperature=0.7, max_tokens=100)
)
SGLang:
import sglang as sgl

llm = sgl.Engine(model_path="meta-llama/Llama-3.1-8B-Instruct")
outputs = llm.generate(
    ["Tell me a joke"],
    sgl.SamplingParams(temperature=0.7, max_tokens=100)
)

Key Differences from vLLM

  1. Prefix Caching: SGLang uses RadixAttention by default (more efficient)
  2. Chunked Prefill: Different default chunk sizes
  3. Memory Management: Different memory fraction defaults
  4. API Compatibility: SGLang is OpenAI-compatible but has additional features

From Text Generation Inference (TGI)

TGI uses a Docker-based approach, while SGLang can run directly: TGI:
docker run --gpus all \
  -p 8080:80 \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Llama-3.1-8B-Instruct
SGLang:
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --port 8080

From LiteLLM

LiteLLM is a proxy/router, while SGLang is an inference engine. You can use LiteLLM with SGLang:
import litellm

# Point LiteLLM to SGLang endpoint
response = litellm.completion(
    model="openai/meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Hello"}],
    api_base="http://localhost:30000/v1"
)

Breaking Changes by Version

v0.5.0

  • Environment variable prefix changes (SGL_SGLANG_)
  • Timeout units changed from milliseconds to seconds
  • Several FP8/quantization env vars deprecated for CLI flags
  • Memory pool configuration changes

v0.4.0

  • Introduction of deterministic inference mode
  • MoE backend configuration moved to CLI flags
  • FlashInfer becomes the default attention backend
  • Changes to RadixAttention cache behavior

v0.3.0

  • Initial support for DeepSeek V3
  • New multi-node deployment options
  • Changes to expert parallelism configuration

Best Practices for Migration

1. Test in Staging First

Always test new versions in a staging environment before production deployment.

2. Review Deprecation Warnings

Pay attention to deprecation warnings in logs:
python -m sglang.launch_server --model-path YOUR_MODEL 2>&1 | grep -i "deprecat"

3. Pin Versions in Production

Use specific versions in your requirements:
sglang==0.5.6  # Not sglang>=0.5.0

4. Check Release Notes

Always review release notes before upgrading.

5. Update Configuration Files

If you use configuration files, update them according to the new format:
# config.py - Before
config = {
    "env": {
        "SGLANG_ENABLE_FLASHINFER_FP8_GEMM": "true"
    }
}

# config.py - After
config = {
    "args": [
        "--fp8-gemm-backend", "flashinfer_trtllm"
    ]
}

6. Monitor Performance

After migration, monitor key metrics:
  • Throughput (requests/second)
  • Latency (p50, p95, p99)
  • GPU memory usage
  • Error rates
See Observability for monitoring setup.

Backward Compatibility

SGLang maintains backward compatibility within minor versions:
  • 0.5.0 → 0.5.6: Fully compatible
  • 0.4.x → 0.5.x: Deprecation warnings, but works
  • 0.3.x → 0.5.x: May require configuration updates

Getting Help with Migration

If you encounter issues during migration:
  1. Check migration issues: Search GitHub Issues with label migration
  2. Ask in Slack: Join https://slack.sglang.io/ and ask in #general or #help
  3. Consult documentation: Check version-specific docs
  4. Report problems: File an issue with your migration scenario

See Also