Documentation Index
Fetch the complete documentation index at: https://mintlify.com/sgl-project/sglang/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Sampling parameters control how the model generates text. They affect randomness, diversity, length, and structure of the output.
Quick Reference
sampling_params = {
"max_new_tokens": 256,
"temperature": 0.8,
"top_p": 0.95,
"top_k": 50,
"frequency_penalty": 0.0,
"presence_penalty": 0.0,
"stop": ["\n\n", "END"]
}
Token Generation
max_new_tokens
Maximum number of tokens to generate.
# Short response
engine.generate(
prompt="What is AI?",
sampling_params={"max_new_tokens": 50}
)
# Long response
engine.generate(
prompt="Write a detailed essay",
sampling_params={"max_new_tokens": 2048}
)
min_new_tokens
Minimum number of tokens to generate before allowing stop sequences or EOS.
# Ensure at least 100 tokens are generated
engine.generate(
prompt="Explain quantum physics",
sampling_params={
"min_new_tokens": 100,
"max_new_tokens": 500
}
)
ignore_eos
Continue generation even after EOS token is generated.
# Ignore end-of-sequence token
engine.generate(
prompt="Count to 100",
sampling_params={
"max_new_tokens": 1000,
"ignore_eos": True
}
)
Randomness Control
temperature
Controls randomness. Lower values (0.0-0.5) make output more focused and deterministic.
Higher values (0.8-2.0) make output more creative and diverse.Setting to 0.0 enables greedy decoding (always pick most likely token).
# Factual, deterministic output
engine.generate(
prompt="What is the capital of France?",
sampling_params={"temperature": 0.0}
)
# Creative, varied output
engine.generate(
prompt="Write a creative story",
sampling_params={"temperature": 0.9}
)
# Very random (can be incoherent)
engine.generate(
prompt="Generate random text",
sampling_params={"temperature": 1.5}
)
Best Practices:
- 0.0: Math, factual QA, code generation
- 0.3-0.5: General assistant, summaries
- 0.7-0.9: Creative writing, brainstorming
- 1.0+: Experimental, high diversity needed
top_p (Nucleus Sampling)
Cumulative probability threshold for nucleus sampling. Only tokens with cumulative
probability up to top_p are considered. Range: (0.0, 1.0]Lower values (0.1-0.5) produce more focused output.
Higher values (0.9-1.0) allow more diversity.
# Very focused
engine.generate(
prompt="Summarize this article",
sampling_params={"top_p": 0.1, "temperature": 0.7}
)
# Balanced
engine.generate(
prompt="Write a paragraph",
sampling_params={"top_p": 0.9, "temperature": 0.8}
)
# Maximum diversity
engine.generate(
prompt="Brainstorm ideas",
sampling_params={"top_p": 1.0, "temperature": 1.0}
)
top_k
Only sample from the top K most likely tokens. Set to -1 to disable (consider all tokens).
# Very constrained
engine.generate(
prompt="Complete: The capital of France is",
sampling_params={"top_k": 5, "temperature": 0.7}
)
# More options
engine.generate(
prompt="Write creatively",
sampling_params={"top_k": 50, "temperature": 0.8}
)
# Unconstrained (default)
engine.generate(
prompt="Generate text",
sampling_params={"top_k": -1, "temperature": 0.8}
)
min_p
Minimum probability threshold. Tokens with probability below min_p are filtered out.
Range: [0.0, 1.0]
# Filter out low-probability tokens
engine.generate(
prompt="Write text",
sampling_params={
"min_p": 0.05, # Ignore tokens with p < 5%
"temperature": 0.8
}
)
Repetition Control
frequency_penalty
Penalize tokens based on their frequency in the generated text. Higher values reduce repetition.
Range: [-2.0, 2.0]Positive values: Discourage repetition
Negative values: Encourage repetition
# Reduce repetition
engine.generate(
prompt="Write a diverse essay",
sampling_params={
"frequency_penalty": 0.5,
"max_new_tokens": 500
}
)
# Strong anti-repetition
engine.generate(
prompt="List unique ideas",
sampling_params={
"frequency_penalty": 1.0,
"max_new_tokens": 200
}
)
presence_penalty
Penalize tokens that have already appeared (regardless of frequency). Range: [-2.0, 2.0]Positive values: Encourage new topics
Negative values: Stay on topic
# Encourage topic diversity
engine.generate(
prompt="Brainstorm topics",
sampling_params={
"presence_penalty": 0.6,
"max_new_tokens": 300
}
)
repetition_penalty
Apply a penalty to tokens that have been generated. Range: [0.0, 2.0]Values > 1.0: Discourage repetition
Value = 1.0: No penalty (default)
Values < 1.0: Encourage repetition
# Reduce repetition (alternative to frequency_penalty)
engine.generate(
prompt="Write text",
sampling_params={
"repetition_penalty": 1.2,
"max_new_tokens": 200
}
)
Penalty Comparison:
frequency_penalty: Linear scaling based on token frequency
presence_penalty: Binary (appeared or not)
repetition_penalty: Multiplicative penalty on logits
Stop Conditions
stop
stop
string | array
default:"null"
Stop generation when any of these strings are generated.
# Single stop string
engine.generate(
prompt="List items:\n1.",
sampling_params={
"stop": "\n\n",
"max_new_tokens": 200
}
)
# Multiple stop strings
engine.generate(
prompt="Write code",
sampling_params={
"stop": ["```", "\n\nEND", "<|endoftext|>"],
"max_new_tokens": 500
}
)
stop_token_ids
Stop generation when any of these token IDs are generated.
# Stop on specific token IDs
engine.generate(
prompt="Generate text",
sampling_params={
"stop_token_ids": [128001, 128009], # Model-specific IDs
"max_new_tokens": 200
}
)
stop_regex
stop_regex
string | array
default:"null"
Stop generation when output matches any of these regex patterns.
# Stop when a number pattern appears
engine.generate(
prompt="Count:",
sampling_params={
"stop_regex": r"\d{3}", # Stop at 3-digit number
"max_new_tokens": 100
}
)
# Multiple regex patterns
engine.generate(
prompt="Write text",
sampling_params={
"stop_regex": [r"\bEND\b", r"\d{4}-\d{2}-\d{2}"],
"max_new_tokens": 300
}
)
no_stop_trim
If true, don’t remove the stop string from the output.
# Include stop string in output
engine.generate(
prompt="Count to 3:",
sampling_params={
"stop": "\n\n",
"no_stop_trim": True,
"max_new_tokens": 50
}
)
Structured Output
json_schema
JSON schema to constrain output. Ensures generated text is valid JSON matching the schema.
import json
schema = {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer", "minimum": 0},
"email": {"type": "string", "format": "email"},
"hobbies": {
"type": "array",
"items": {"type": "string"}
}
},
"required": ["name", "age"]
}
response = engine.generate(
prompt="Generate a person profile",
sampling_params={
"json_schema": json.dumps(schema),
"max_new_tokens": 200
}
)
data = json.loads(response["text"])
print(f"Name: {data['name']}, Age: {data['age']}")
regex
Regular expression pattern to constrain output format.
# Phone number format
engine.generate(
prompt="Generate a phone number:",
sampling_params={
"regex": r"\d{3}-\d{3}-\d{4}",
"max_new_tokens": 20
}
)
# Output: "555-123-4567"
# Email format
engine.generate(
prompt="Generate an email:",
sampling_params={
"regex": r"[a-z]+@[a-z]+\.[a-z]+",
"max_new_tokens": 30
}
)
# Output: "user@example.com"
# Date format
engine.generate(
prompt="Today's date:",
sampling_params={
"regex": r"\d{4}-\d{2}-\d{2}",
"max_new_tokens": 15
}
)
# Output: "2024-03-15"
ebnf
EBNF (Extended Backus-Naur Form) grammar to constrain output.
# Mathematical expression grammar
grammar = """
root ::= expression
expression ::= term (["+" "-"] term)*
term ::= factor (["*" "/"] factor)*
factor ::= number | "(" expression ")"
number ::= [0-9]+
"""
engine.generate(
prompt="Generate a math expression:",
sampling_params={
"ebnf": grammar,
"max_new_tokens": 50
}
)
# Output: "(5 + 3) * 2 - 7"
# SQL query grammar
sql_grammar = """
root ::= select_stmt
select_stmt ::= "SELECT" column_list "FROM" table_name where_clause?
column_list ::= "*" | column_name ("," column_name)*
where_clause ::= "WHERE" condition
condition ::= column_name "=" value
column_name ::= [a-zA-Z_]+
table_name ::= [a-zA-Z_]+
value ::= [0-9]+ | "'" [a-zA-Z ]+ "'"
"""
engine.generate(
prompt="Generate a SQL query:",
sampling_params={
"ebnf": sql_grammar,
"max_new_tokens": 100
}
)
# Output: "SELECT * FROM users WHERE id = 5"
Advanced Parameters
n (Number of Completions)
Generate N independent completions for each prompt.
# Generate 3 different responses
response = engine.generate(
prompt="Write a creative opening sentence",
sampling_params={
"n": 3,
"temperature": 0.9,
"max_new_tokens": 50
}
)
for i, text in enumerate(response["text"]):
print(f"Option {i+1}: {text}")
logit_bias
Modify the likelihood of specific tokens. Keys are token IDs, values are bias adjustments.
Range: Typically [-100, 100]
# Discourage specific tokens
tokenizer = engine.tokenizer_manager.tokenizer
token_id = tokenizer.encode("bad")[0]
engine.generate(
prompt="Write a review",
sampling_params={
"logit_bias": {str(token_id): -10.0},
"max_new_tokens": 100
}
)
sampling_seed
Random seed for reproducible sampling. Set this for deterministic outputs.
# Reproducible generation
for i in range(3):
response = engine.generate(
prompt="Generate a random number",
sampling_params={
"sampling_seed": 42,
"temperature": 1.0,
"max_new_tokens": 10
}
)
print(response["text"]) # Same output each time
skip_special_tokens
Remove special tokens (BOS, EOS, PAD) from decoded output.
# Include special tokens in output
engine.generate(
prompt="Hello",
sampling_params={
"skip_special_tokens": False,
"max_new_tokens": 20
}
)
spaces_between_special_tokens
spaces_between_special_tokens
Add spaces between special tokens when decoding.
Parameter Combinations
Creative Writing
sampling_params = {
"max_new_tokens": 500,
"temperature": 0.9,
"top_p": 0.95,
"frequency_penalty": 0.3,
"presence_penalty": 0.3
}
Code Generation
sampling_params = {
"max_new_tokens": 512,
"temperature": 0.2,
"top_p": 0.95,
"stop": ["\n\n", "```"],
"repetition_penalty": 1.1
}
Factual Q&A
sampling_params = {
"max_new_tokens": 150,
"temperature": 0.0, # Greedy
"top_k": 1
}
JSON Generation
sampling_params = {
"max_new_tokens": 300,
"temperature": 0.3,
"json_schema": json.dumps(schema)
}
Diverse Brainstorming
sampling_params = {
"max_new_tokens": 200,
"temperature": 1.2,
"top_p": 0.98,
"presence_penalty": 0.8,
"n": 5 # Generate 5 ideas
}
Parameter Validation
SGLang validates parameters and raises errors for invalid values:
try:
engine.generate(
prompt="test",
sampling_params={"temperature": -1.0} # Invalid
)
except ValueError as e:
print(f"Invalid parameter: {e}")
Validation Rules:
temperature >= 0.0
0.0 < top_p <= 1.0
0.0 <= min_p <= 1.0
top_k >= 1 or top_k == -1
-2.0 <= frequency_penalty <= 2.0
-2.0 <= presence_penalty <= 2.0
0.0 <= repetition_penalty <= 2.0
0 <= min_new_tokens <= max_new_tokens
- Only one of
json_schema, regex, ebnf can be set
See Also