Documentation Index
Fetch the complete documentation index at: https://mintlify.com/sgl-project/sglang/llms.txt
Use this file to discover all available pages before exploring further.
Meta’s Llama series represents one of the most widely-used families of open-source large language models, ranging from 7B to 400B parameters across Llama 2, Llama 3, and Llama 4 generations.
Overview
Llama 4 is Meta’s latest generation with industry-leading performance. SGLang has provided first-class support and optimizations for Llama models since v0.4.5.
Supported Llama Models
- Llama 4 Scout (109B) - Latest generation
- Llama 4 Maverick (400B) - Largest Llama model
- Llama 3.x series (1B, 3B, 8B, 70B) - Previous generation
- Llama 2 series (7B, 13B, 70B) - Foundation models
- Llama Vision (11B, 90B) - Multimodal variants
- Specialized variants: Classification, Embedding, Reward models
Quick Start
Basic Launch Command
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-3.2-1B-Instruct \
--host 0.0.0.0 \
--port 30000
Llama 4 Launch (8xH100/H200)
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-4-Scout-17B-16E-Instruct \
--tp 8 \
--context-length 1000000
Llama 4 Configuration
Hardware Recommendations
| Model | Hardware | Context Length | Notes |
|---|
| Scout (109B) | 8×H100 | Up to 1M | Adjust --context-length to avoid OOM |
| Scout (109B) | 8×H200 | Up to 2.5M | Extended context support |
| Scout (109B) + Hybrid KV | 8×H100 | Up to 5M | With --swa-full-tokens-ratio |
| Scout (109B) + Hybrid KV | 8×H200 | Up to 10M | Maximum supported context |
| Maverick (400B) | 8×H200 | Up to 1M | Full precision |
| Maverick (400B) | 8×B200 | - | Optimal performance |
Configuration Tips
Attention Backend Auto-Selection
SGLang automatically selects the optimal attention backend based on your hardware:
- Blackwell GPUs (B200/GB200):
trtllm_mha
- Hopper GPUs (H100/H200):
fa3 (FlashAttention 3)
- AMD GPUs:
aiter
- Intel XPU:
intel_xpu
- Other platforms:
triton (fallback)
To override auto-selection:
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-4-Scout-17B-16E-Instruct \
--tp 8 \
--attention-backend fa3
Context Length Management
Adjust --context-length to avoid GPU out-of-memory issues:
# Scout on 8×H100 - up to 1M tokens
--context-length 1000000
# Scout on 8×H200 - up to 2.5M tokens
--context-length 2500000
Hybrid KV Cache
Enable hybrid KV cache for extended context lengths using Llama 4’s local attention layers:
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-4-Scout-17B-16E-Instruct \
--tp 8 \
--context-length 5000000 \
--swa-full-tokens-ratio 0.8 # Ratio of SWA layer KV tokens (default: 0.8, range: 0-1)
Chat Template
For chat completion tasks, add the Llama 4 chat template:
Multimodal Support
For Llama Vision models:
EAGLE Speculative Decoding
Llama 4 Maverick (400B) supports EAGLE speculative decoding for accelerated inference.
Launch with EAGLE
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-4-Maverick-17B-128E-Instruct \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path nvidia/Llama-4-Maverick-17B-128E-Eagle3 \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--trust-remote-code \
--tp 8 \
--context-length 1000000
Note: The Llama 4 EAGLE draft model (nvidia/Llama-4-Maverick-17B-128E-Eagle3) only recognizes conversations in chat mode.
Benchmarks
Accuracy Test (MMLU Pro)
SGLang achieves accuracy matching or exceeding official benchmarks:
| Model | Official Benchmark | SGLang | Hardware |
|---|
| Llama-4-Scout-17B-16E-Instruct | 74.3 | 75.2 | 8×H100 |
| Llama-4-Maverick-17B-128E-Instruct | 80.5 | 80.7 | 8×H100 |
Running Accuracy Tests
Llama-4-Scout
# Launch server
python -m sglang.launch_server \
--model-path meta-llama/Llama-4-Scout-17B-16E-Instruct \
--port 30000 \
--tp 8 \
--mem-fraction-static 0.8 \
--context-length 65536
# Run lm_eval
lm_eval --model local-chat-completions \
--model_args model=meta-llama/Llama-4-Scout-17B-16E-Instruct,base_url=http://localhost:30000/v1/chat/completions,num_concurrent=128,timeout=999999,max_gen_toks=2048 \
--tasks mmlu_pro \
--batch_size 128 \
--apply_chat_template \
--num_fewshot 0
Llama-4-Maverick
# Launch server
python -m sglang.launch_server \
--model-path meta-llama/Llama-4-Maverick-17B-128E-Instruct \
--port 30000 \
--tp 8 \
--mem-fraction-static 0.8 \
--context-length 65536
# Run lm_eval
lm_eval --model local-chat-completions \
--model_args model=meta-llama/Llama-4-Maverick-17B-128E-Instruct,base_url=http://localhost:30000/v1/chat/completions,num_concurrent=128,timeout=999999,max_gen_toks=2048 \
--tasks mmlu_pro \
--batch_size 128 \
--apply_chat_template \
--num_fewshot 0
Llama 3.x Models
Llama 3.x models (1B, 3B, 8B, 70B) are also fully supported:
# Llama 3.2 1B (lightweight)
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-3.2-1B-Instruct \
--port 30000
# Llama 3.1 8B (popular size)
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--port 30000
# Llama 3.1 70B (multi-GPU)
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-70B-Instruct \
--tp 4 \
--port 30000
Specialized Llama Variants
SGLang supports specialized Llama model variants:
Embedding Models
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-3.2-1B-Embedding \
--port 30000
Classification Models
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-3.2-1B-Classification \
--port 30000
Reward Models
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-3.2-1B-Reward \
--port 30000
Llama Vision (Multimodal)
Llama 3.2 includes vision-enabled variants (11B, 90B). See the Multimodal Models guide for detailed usage.
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-3.2-11B-Vision-Instruct \
--enable-multimodal \
--port 30000
Advanced Features
EAGLE Decoding for Llama 3
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-70B-Instruct \
--speculative-algorithm EAGLE \
--speculative-draft-model-path <eagle-draft-model> \
--speculative-num-steps 3 \
--tp 4
Quantization
SGLang supports various quantization methods for Llama models:
# FP8 quantization
--quantization fp8
# AWQ quantization
--quantization awq
# GPTQ quantization
--quantization gptq
Resources
Troubleshooting
Out of Memory (OOM)
Reduce --context-length:
--context-length 512000 # Reduce from 1M to 512K
Or reduce memory fraction:
--mem-fraction-static 0.8 # Reduce from default 0.9
Slow Model Loading
Increase timeout:
--watchdog-timeout 1200 # Increase to 20 minutes
Enable parallel weight loading:
--model-loader-extra-config '{"enable_multithread_load": true}'