Quick Start - SGLang

This guide will help you launch an SGLang server and send your first requests using both the OpenAI-compatible API and the native SGLang API.

Prerequisites

Install SGLang

First, install SGLang using pip or uv:

pip install --upgrade pip
pip install uv
uv pip install sglang

See the Installation Guide for other installation methods.

Verify Installation

python -c "import sglang; print(sglang.__version__)"

Launch Your First Server

Start the SGLang server

Launch a server with a small model for testing:

python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 127.0.0.1 \
  --port 30000

The server will download the model from Hugging Face on first launch. Set the HF_TOKEN environment variable if you need to access gated models:

export HF_TOKEN=your_huggingface_token

Common Server Launch Options

python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --tp 2

Wait for the server to be ready

Look for the following message in the logs:

INFO:     Started server process
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:30000

Send Your First Request

Using OpenAI-Compatible API

SGLang provides OpenAI-compatible endpoints, making it easy to integrate with existing applications.

from openai import OpenAI

# Create an OpenAI client pointing to SGLang server
client = OpenAI(
    base_url="http://127.0.0.1:30000/v1",
    api_key="EMPTY"  # SGLang doesn't require authentication by default
)

# Chat completion
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"}
    ],
    temperature=0.7,
    max_tokens=256
)

print(response.choices[0].message.content)

Expected Output:

The capital of France is Paris. It is one of the most famous and beautiful cities in the world, known for its iconic landmarks like the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral.

Using Native SGLang API

The native SGLang API provides a more Pythonic interface with advanced features.

import sglang as sgl
from sglang.srt.server_args import ServerArgs
import dataclasses

# Create an offline engine
server_args = ServerArgs(
    model_path="meta-llama/Llama-3.1-8B-Instruct"
)
llm = sgl.Engine(**dataclasses.asdict(server_args))

# Generate responses
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}
outputs = llm.generate(prompts, sampling_params)

# Print outputs
for prompt, output in zip(prompts, outputs):
    print(f"Prompt: {prompt}")
    print(f"Generated: {output['text']}")
    print("=" * 50)

Expected Output:

Prompt: Hello, my name is
Generated: John, and I'm excited to share my story with you today. I grew up in a small town in the Midwest
==================================================
Prompt: The president of the United States is
Generated: the head of state and head of government of the United States of America. The president directs the executive branch
==================================================

Complete Working Example

Here’s a full end-to-end example you can run:

Create a Python script

Save this as quickstart.py:

quickstart.py

from openai import OpenAI

# Initialize client
client = OpenAI(
    base_url="http://127.0.0.1:30000/v1",
    api_key="EMPTY"
)

# Example 1: Simple chat completion
print("=" * 50)
print("Example 1: Simple Chat")
print("=" * 50)
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "user", "content": "Explain quantum computing in one sentence."}
    ],
    max_tokens=100
)
print(response.choices[0].message.content)

# Example 2: Multi-turn conversation
print("\n" + "=" * 50)
print("Example 2: Multi-turn Conversation")
print("=" * 50)
messages = [
    {"role": "system", "content": "You are a helpful math tutor."},
    {"role": "user", "content": "What is 15 * 23?"}
]
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=messages,
    max_tokens=50
)
assistant_reply = response.choices[0].message.content
print(f"Assistant: {assistant_reply}")

# Continue the conversation
messages.append({"role": "assistant", "content": assistant_reply})
messages.append({"role": "user", "content": "Now multiply that by 2."})
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=messages,
    max_tokens=50
)
print(f"Assistant: {response.choices[0].message.content}")

# Example 3: Streaming response
print("\n" + "=" * 50)
print("Example 3: Streaming")
print("=" * 50)
print("Assistant: ", end="")
stream = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Count from 1 to 5."}],
    max_tokens=50,
    stream=True
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print("\n")

# Example 4: Batch requests
print("=" * 50)
print("Example 4: Batch Processing")
print("=" * 50)
questions = [
    "What is the capital of Japan?",
    "What is the capital of Germany?",
    "What is the capital of Brazil?"
]

for question in questions:
    response = client.chat.completions.create(
        model="meta-llama/Llama-3.1-8B-Instruct",
        messages=[{"role": "user", "content": question}],
        max_tokens=50,
        temperature=0.0  # Deterministic output
    )
    print(f"Q: {question}")
    print(f"A: {response.choices[0].message.content}")
    print()

Ensure the server is running

python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 127.0.0.1 \
  --port 30000

Run the example

python quickstart.py

API Endpoints

SGLang provides several API endpoints:

Endpoint	Description	OpenAI Compatible
`/v1/chat/completions`	Chat completions	✅
`/v1/completions`	Text completions	✅
`/v1/embeddings`	Generate embeddings	✅
`/generate`	Native SGLang generation	❌
`/get_model_info`	Get model metadata	❌
`/health`	Health check	❌

Sampling Parameters

Control generation behavior with these common parameters:

Common Sampling Parameters

Parameter	Type	Default	Description
`temperature`	float	1.0	Controls randomness (0.0 = deterministic, higher = more creative)
`top_p`	float	1.0	Nucleus sampling threshold
`max_tokens`	int	128	Maximum tokens to generate
`frequency_penalty`	float	0.0	Penalize token frequency (-2.0 to 2.0)
`presence_penalty`	float	0.0	Penalize new tokens (-2.0 to 2.0)
`stop`	str/list	None	Stop sequences
`n`	int	1	Number of completions to generate
`stream`	bool	false	Enable streaming responses

See Sampling Parameters for the complete list.

Common Use Cases

Text Completion

response = client.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    prompt="Once upon a time",
    max_tokens=100,
    temperature=0.8
)
print(response.choices[0].text)

JSON Mode / Structured Output

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "user", "content": "List 3 colors in JSON format"}
    ],
    response_format={"type": "json_object"},
    max_tokens=100
)
print(response.choices[0].message.content)

Batch Inference

import sglang as sgl
from sglang.srt.server_args import ServerArgs
import dataclasses

server_args = ServerArgs(model_path="meta-llama/Llama-3.1-8B-Instruct")
llm = sgl.Engine(**dataclasses.asdict(server_args))

prompts = [f"Question {i}: What is 2+{i}?" for i in range(10)]
outputs = llm.generate(prompts, {"temperature": 0.0})

for prompt, output in zip(prompts, outputs):
    print(f"{prompt} -> {output['text']}")

Troubleshooting

Server won't start

Out of memory error:

python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --mem-fraction-static 0.7

CUDA errors:

Verify GPU availability: nvidia-smi
Check CUDA version: nvcc --version
Ensure PyTorch can see GPUs: python -c "import torch; print(torch.cuda.is_available())"

Slow inference

Use tensor parallelism for multi-GPU: --tp 2
Enable FP8 quantization: --quantization fp8
Reduce context length: --context-length 4096
See Performance Tuning for optimization

Connection refused

Ensure server is running: check for “Uvicorn running” message
Verify port is not in use: lsof -i :30000
Check firewall settings
Use correct host/port in client

Next Steps

Server Arguments

Learn about all available server configuration options

Sampling Parameters

Control generation behavior with sampling parameters

Model Support

Browse supported models and architectures

Production Deployment

Deploy SGLang in production with monitoring

​Prerequisites

​Launch Your First Server

​Send Your First Request

​Using OpenAI-Compatible API

​Using Native SGLang API

​Complete Working Example

​API Endpoints

​Sampling Parameters

​Common Use Cases

​Troubleshooting

​Next Steps

Server Arguments

Sampling Parameters

Model Support

Production Deployment

Prerequisites

Launch Your First Server

Send Your First Request

Using OpenAI-Compatible API

Using Native SGLang API

Complete Working Example

API Endpoints

Sampling Parameters

Common Use Cases

Troubleshooting

Next Steps