Documentation Index
Fetch the complete documentation index at: https://mintlify.com/sgl-project/sglang/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Tensor Parallelism (TP) is the most common parallelism strategy for LLM inference, where model weights are distributed across multiple GPUs within a single node. Each GPU holds a portion of each layer’s parameters, enabling models to scale beyond a single GPU’s memory capacity.How It Works
In tensor parallelism:- Model weights are sharded across multiple GPUs
- Each GPU computes a portion of each layer’s output
- All-reduce operations synchronize results across GPUs
- All GPUs process the same batch of requests
Key Characteristics
- Best suited for intra-node scaling (GPUs connected via NVLink/PCIe)
- Requires high-bandwidth communication for all-reduce operations
- Works well for models with standard attention mechanisms (GQA, MHA)
- Memory efficient: Each GPU stores only a portion of model weights
When to Use Tensor Parallelism
Use TP when:- Model doesn’t fit on a single GPU
- You have multiple GPUs in a single node with fast interconnects
- Working with standard attention models (Llama, Qwen, Mistral, etc.)
- You need low latency for small batch sizes
- Using MLA-based models (DeepSeek, MiniMax) → Use Data Parallelism Attention
- Scaling across multiple nodes → Use Pipeline Parallelism
- Working with MoE models → Combine with Expert Parallelism
Configuration
Basic Setup
Enable tensor parallelism with the--tp flag:
Multi-Node Tensor Parallelism
To run TP across multiple nodes:--disable-cuda-graph.
Peer-to-Peer Access
If you encounter the error “peer access is not supported between these two devices”, enable P2P checking:Combining with Other Parallelism
TP + Data Parallelism
Combine TP with DP for models that fit across multiple GPUs but need higher throughput:TP + Expert Parallelism (MoE Models)
For Mixture-of-Experts models, combine TP with EP:TP + Pipeline Parallelism
For very large models with long contexts:Communication Backends
SGLang supports multiple communication backends for all-reduce operations:Custom All-Reduce (Default)
Optimized all-reduce implementation for NVIDIA GPUs:- Automatically enabled for supported architectures
- Falls back to NCCL for unsupported tensor sizes
- Disable with
--disable-custom-all-reduce
PyNccl
Low-level NCCL wrapper for optimized GPU communication:- Used for CUDA graph mode
- Supports symmetric memory allocation
Hardware-Specific Backends
AMD (ROCm):Performance Tuning
Memory Management
Control KV cache memory allocation:--mem-fraction-static if you encounter OOM errors.
Attention Backend
Select the optimal attention implementation:Deterministic All-Reduce
For reproducible results (AMD GPUs):Troubleshooting
Deadlock During Initialization
Symptom: Server hangs during model loading with multi-node TP Solution:P2P Access Errors
Symptom: “peer access is not supported between these two devices” Solution:OOM Errors
Symptom: Out of memory during serving Solution:Communication Overhead
Symptom: Poor throughput with multi-node TP Solution: Consider Pipeline Parallelism for cross-node deployments:Configuration Summary
| Parameter | Description | Default | Recommended Values |
|---|---|---|---|
--tp | Tensor parallel size | 1 | Power of 2 (2, 4, 8) |
--nnodes | Number of nodes | 1 | 1-4 for TP |
--dist-init-addr | Master node address | None | <IP>:29500 |
--mem-fraction-static | KV cache memory fraction | 0.9 | 0.7-0.9 |
--enable-p2p-check | Check GPU P2P support | False | Enable if needed |
--disable-cuda-graph | Disable CUDA graphs | False | Enable for debugging |
Best Practices
- Start with single-node TP before scaling to multiple nodes
- Use power-of-2 TP sizes (2, 4, 8) for optimal performance
- Monitor GPU utilization to ensure balanced workloads
- Test P2P connectivity before production deployments
- Consider alternatives for MLA models and MoE architectures
Related Documentation
- Data Parallelism - For higher throughput with replicas
- Expert Parallelism - For MoE models
- Pipeline Parallelism - For multi-node scaling
- Server Arguments - Complete argument reference
