Tensor Parallelism

Overview

Tensor Parallelism (TP) is the most common parallelism strategy for LLM inference, where model weights are distributed across multiple GPUs within a single node. Each GPU holds a portion of each layer’s parameters, enabling models to scale beyond a single GPU’s memory capacity.

How It Works

In tensor parallelism:

Model weights are sharded across multiple GPUs
Each GPU computes a portion of each layer’s output
All-reduce operations synchronize results across GPUs
All GPUs process the same batch of requests

Key Characteristics

Best suited for intra-node scaling (GPUs connected via NVLink/PCIe)
Requires high-bandwidth communication for all-reduce operations
Works well for models with standard attention mechanisms (GQA, MHA)
Memory efficient: Each GPU stores only a portion of model weights

When to Use Tensor Parallelism

Use TP when:

Model doesn’t fit on a single GPU
You have multiple GPUs in a single node with fast interconnects
Working with standard attention models (Llama, Qwen, Mistral, etc.)
You need low latency for small batch sizes

Consider alternatives when:

Using MLA-based models (DeepSeek, MiniMax) → Use Data Parallelism Attention
Scaling across multiple nodes → Use Pipeline Parallelism
Working with MoE models → Combine with Expert Parallelism

Configuration

Basic Setup

Enable tensor parallelism with the --tp flag:

python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tp 4

This distributes the model across 4 GPUs on a single node.

Multi-Node Tensor Parallelism

To run TP across multiple nodes:

# Node 0 (Master)
python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tp 8 \
  --nnodes 2 \
  --node-rank 0 \
  --dist-init-addr <MASTER_NODE_IP>:29500

# Node 1
python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tp 8 \
  --nnodes 2 \
  --node-rank 1 \
  --dist-init-addr <MASTER_NODE_IP>:29500

Important: Multi-node TP requires fast interconnects (InfiniBand, RoCE). If you experience deadlocks, add --disable-cuda-graph.

Peer-to-Peer Access

If you encounter the error “peer access is not supported between these two devices”, enable P2P checking:

python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tp 4 \
  --enable-p2p-check

Combining with Other Parallelism

TP + Data Parallelism

Combine TP with DP for models that fit across multiple GPUs but need higher throughput:

python -m sglang_router.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tp 4 \
  --dp-size 2

This creates 2 replicas, each using 4-way TP (8 GPUs total).

TP + Expert Parallelism (MoE Models)

For Mixture-of-Experts models, combine TP with EP:

python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V3 \
  --tp 8 \
  --ep 8 \
  --moe-a2a-backend deepep

See Expert Parallelism for details.

TP + Pipeline Parallelism

For very large models with long contexts:

python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V3.1 \
  --tp 8 \
  --pp-size 4 \
  --chunked-prefill-size 4096

See Pipeline Parallelism for details.

Communication Backends

SGLang supports multiple communication backends for all-reduce operations:

Custom All-Reduce (Default)

Optimized all-reduce implementation for NVIDIA GPUs:

Automatically enabled for supported architectures
Falls back to NCCL for unsupported tensor sizes
Disable with --disable-custom-all-reduce

PyNccl

Low-level NCCL wrapper for optimized GPU communication:

Used for CUDA graph mode
Supports symmetric memory allocation

Hardware-Specific Backends

AMD (ROCm):

# QuickAllReduce for MI300+ GPUs
export SGLANG_USE_1STAGE_ALLREDUCE=0  # 2-stage for large tensors
python -m sglang.launch_server --model-path ... --tp 8

Intel (XPU):

python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tp 4 \
  --device xpu

Huawei Ascend (NPU):

export HCCL_BUFFSIZE=256  # Set HCCL buffer size (MB)
python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tp 8

Performance Tuning

Memory Management

Control KV cache memory allocation:

python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tp 4 \
  --mem-fraction-static 0.85  # Use 85% of GPU memory for KV cache

Reduce --mem-fraction-static if you encounter OOM errors.

Attention Backend

Select the optimal attention implementation:

# FlashAttention-3 (recommended for H100/H200)
python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tp 4 \
  --attention-backend fa3

# FlashInfer (recommended for A100/A10)
python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tp 4 \
  --attention-backend flashinfer

Deterministic All-Reduce

For reproducible results (AMD GPUs):

export SGLANG_USE_1STAGE_ALLREDUCE=1
python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tp 8 \
  --enable-deterministic-inference

Troubleshooting

Deadlock During Initialization

Symptom: Server hangs during model loading with multi-node TP Solution:

python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tp 8 \
  --nnodes 2 \
  --node-rank 0 \
  --dist-init-addr <MASTER_NODE_IP>:29500 \
  --disable-cuda-graph

P2P Access Errors

Symptom: “peer access is not supported between these two devices” Solution:

python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tp 4 \
  --enable-p2p-check

OOM Errors

Symptom: Out of memory during serving Solution:

python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tp 4 \
  --mem-fraction-static 0.7  # Reduce KV cache size

For long prompts, enable chunked prefill:

python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tp 4 \
  --chunked-prefill-size 4096

Communication Overhead

Symptom: Poor throughput with multi-node TP Solution: Consider Pipeline Parallelism for cross-node deployments:

python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tp 4 \
  --pp-size 2 \
  --chunked-prefill-size 4096

Configuration Summary

Parameter	Description	Default	Recommended Values
`--tp`	Tensor parallel size	`1`	Power of 2 (2, 4, 8)
`--nnodes`	Number of nodes	`1`	1-4 for TP
`--dist-init-addr`	Master node address	`None`	`<IP>:29500`
`--mem-fraction-static`	KV cache memory fraction	`0.9`	0.7-0.9
`--enable-p2p-check`	Check GPU P2P support	`False`	Enable if needed
`--disable-cuda-graph`	Disable CUDA graphs	`False`	Enable for debugging

Best Practices

Start with single-node TP before scaling to multiple nodes
Use power-of-2 TP sizes (2, 4, 8) for optimal performance
Monitor GPU utilization to ensure balanced workloads
Test P2P connectivity before production deployments
Consider alternatives for MLA models and MoE architectures

Data Parallelism - For higher throughput with replicas
Expert Parallelism - For MoE models
Pipeline Parallelism - For multi-node scaling
Server Arguments - Complete argument reference

​Overview

​How It Works

​Key Characteristics

​When to Use Tensor Parallelism

​Configuration

​Basic Setup

​Multi-Node Tensor Parallelism

​Peer-to-Peer Access

​Combining with Other Parallelism

​TP + Data Parallelism

​TP + Expert Parallelism (MoE Models)

​TP + Pipeline Parallelism

​Communication Backends

​Custom All-Reduce (Default)

​PyNccl

​Hardware-Specific Backends

​Performance Tuning

​Memory Management

​Attention Backend

​Deterministic All-Reduce

​Troubleshooting

​Deadlock During Initialization

​P2P Access Errors

​OOM Errors

​Communication Overhead

​Configuration Summary

​Best Practices

​Related Documentation