Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/sgl-project/sglang/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Sampling parameters control how the model generates text. They affect randomness, diversity, length, and structure of the output.

Quick Reference

sampling_params = {
    "max_new_tokens": 256,
    "temperature": 0.8,
    "top_p": 0.95,
    "top_k": 50,
    "frequency_penalty": 0.0,
    "presence_penalty": 0.0,
    "stop": ["\n\n", "END"]
}

Token Generation

max_new_tokens

max_new_tokens
int
default:"128"
Maximum number of tokens to generate.
# Short response
engine.generate(
    prompt="What is AI?",
    sampling_params={"max_new_tokens": 50}
)

# Long response
engine.generate(
    prompt="Write a detailed essay",
    sampling_params={"max_new_tokens": 2048}
)

min_new_tokens

min_new_tokens
int
default:"0"
Minimum number of tokens to generate before allowing stop sequences or EOS.
# Ensure at least 100 tokens are generated
engine.generate(
    prompt="Explain quantum physics",
    sampling_params={
        "min_new_tokens": 100,
        "max_new_tokens": 500
    }
)

ignore_eos

ignore_eos
bool
default:"false"
Continue generation even after EOS token is generated.
# Ignore end-of-sequence token
engine.generate(
    prompt="Count to 100",
    sampling_params={
        "max_new_tokens": 1000,
        "ignore_eos": True
    }
)

Randomness Control

temperature

temperature
float
default:"1.0"
Controls randomness. Lower values (0.0-0.5) make output more focused and deterministic. Higher values (0.8-2.0) make output more creative and diverse.Setting to 0.0 enables greedy decoding (always pick most likely token).
# Factual, deterministic output
engine.generate(
    prompt="What is the capital of France?",
    sampling_params={"temperature": 0.0}
)

# Creative, varied output
engine.generate(
    prompt="Write a creative story",
    sampling_params={"temperature": 0.9}
)

# Very random (can be incoherent)
engine.generate(
    prompt="Generate random text",
    sampling_params={"temperature": 1.5}
)
Best Practices:
  • 0.0: Math, factual QA, code generation
  • 0.3-0.5: General assistant, summaries
  • 0.7-0.9: Creative writing, brainstorming
  • 1.0+: Experimental, high diversity needed

top_p (Nucleus Sampling)

top_p
float
default:"1.0"
Cumulative probability threshold for nucleus sampling. Only tokens with cumulative probability up to top_p are considered. Range: (0.0, 1.0]Lower values (0.1-0.5) produce more focused output. Higher values (0.9-1.0) allow more diversity.
# Very focused
engine.generate(
    prompt="Summarize this article",
    sampling_params={"top_p": 0.1, "temperature": 0.7}
)

# Balanced
engine.generate(
    prompt="Write a paragraph",
    sampling_params={"top_p": 0.9, "temperature": 0.8}
)

# Maximum diversity
engine.generate(
    prompt="Brainstorm ideas",
    sampling_params={"top_p": 1.0, "temperature": 1.0}
)

top_k

top_k
int
default:"-1"
Only sample from the top K most likely tokens. Set to -1 to disable (consider all tokens).
# Very constrained
engine.generate(
    prompt="Complete: The capital of France is",
    sampling_params={"top_k": 5, "temperature": 0.7}
)

# More options
engine.generate(
    prompt="Write creatively",
    sampling_params={"top_k": 50, "temperature": 0.8}
)

# Unconstrained (default)
engine.generate(
    prompt="Generate text",
    sampling_params={"top_k": -1, "temperature": 0.8}
)

min_p

min_p
float
default:"0.0"
Minimum probability threshold. Tokens with probability below min_p are filtered out. Range: [0.0, 1.0]
# Filter out low-probability tokens
engine.generate(
    prompt="Write text",
    sampling_params={
        "min_p": 0.05,  # Ignore tokens with p < 5%
        "temperature": 0.8
    }
)

Repetition Control

frequency_penalty

frequency_penalty
float
default:"0.0"
Penalize tokens based on their frequency in the generated text. Higher values reduce repetition. Range: [-2.0, 2.0]Positive values: Discourage repetition Negative values: Encourage repetition
# Reduce repetition
engine.generate(
    prompt="Write a diverse essay",
    sampling_params={
        "frequency_penalty": 0.5,
        "max_new_tokens": 500
    }
)

# Strong anti-repetition
engine.generate(
    prompt="List unique ideas",
    sampling_params={
        "frequency_penalty": 1.0,
        "max_new_tokens": 200
    }
)

presence_penalty

presence_penalty
float
default:"0.0"
Penalize tokens that have already appeared (regardless of frequency). Range: [-2.0, 2.0]Positive values: Encourage new topics Negative values: Stay on topic
# Encourage topic diversity
engine.generate(
    prompt="Brainstorm topics",
    sampling_params={
        "presence_penalty": 0.6,
        "max_new_tokens": 300
    }
)

repetition_penalty

repetition_penalty
float
default:"1.0"
Apply a penalty to tokens that have been generated. Range: [0.0, 2.0]Values > 1.0: Discourage repetition Value = 1.0: No penalty (default) Values < 1.0: Encourage repetition
# Reduce repetition (alternative to frequency_penalty)
engine.generate(
    prompt="Write text",
    sampling_params={
        "repetition_penalty": 1.2,
        "max_new_tokens": 200
    }
)
Penalty Comparison:
  • frequency_penalty: Linear scaling based on token frequency
  • presence_penalty: Binary (appeared or not)
  • repetition_penalty: Multiplicative penalty on logits

Stop Conditions

stop

stop
string | array
default:"null"
Stop generation when any of these strings are generated.
# Single stop string
engine.generate(
    prompt="List items:\n1.",
    sampling_params={
        "stop": "\n\n",
        "max_new_tokens": 200
    }
)

# Multiple stop strings
engine.generate(
    prompt="Write code",
    sampling_params={
        "stop": ["```", "\n\nEND", "<|endoftext|>"],
        "max_new_tokens": 500
    }
)

stop_token_ids

stop_token_ids
array[int]
default:"null"
Stop generation when any of these token IDs are generated.
# Stop on specific token IDs
engine.generate(
    prompt="Generate text",
    sampling_params={
        "stop_token_ids": [128001, 128009],  # Model-specific IDs
        "max_new_tokens": 200
    }
)

stop_regex

stop_regex
string | array
default:"null"
Stop generation when output matches any of these regex patterns.
# Stop when a number pattern appears
engine.generate(
    prompt="Count:",
    sampling_params={
        "stop_regex": r"\d{3}",  # Stop at 3-digit number
        "max_new_tokens": 100
    }
)

# Multiple regex patterns
engine.generate(
    prompt="Write text",
    sampling_params={
        "stop_regex": [r"\bEND\b", r"\d{4}-\d{2}-\d{2}"],
        "max_new_tokens": 300
    }
)

no_stop_trim

no_stop_trim
bool
default:"false"
If true, don’t remove the stop string from the output.
# Include stop string in output
engine.generate(
    prompt="Count to 3:",
    sampling_params={
        "stop": "\n\n",
        "no_stop_trim": True,
        "max_new_tokens": 50
    }
)

Structured Output

json_schema

json_schema
string
default:"null"
JSON schema to constrain output. Ensures generated text is valid JSON matching the schema.
import json

schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "integer", "minimum": 0},
        "email": {"type": "string", "format": "email"},
        "hobbies": {
            "type": "array",
            "items": {"type": "string"}
        }
    },
    "required": ["name", "age"]
}

response = engine.generate(
    prompt="Generate a person profile",
    sampling_params={
        "json_schema": json.dumps(schema),
        "max_new_tokens": 200
    }
)

data = json.loads(response["text"])
print(f"Name: {data['name']}, Age: {data['age']}")

regex

regex
string
default:"null"
Regular expression pattern to constrain output format.
# Phone number format
engine.generate(
    prompt="Generate a phone number:",
    sampling_params={
        "regex": r"\d{3}-\d{3}-\d{4}",
        "max_new_tokens": 20
    }
)
# Output: "555-123-4567"

# Email format
engine.generate(
    prompt="Generate an email:",
    sampling_params={
        "regex": r"[a-z]+@[a-z]+\.[a-z]+",
        "max_new_tokens": 30
    }
)
# Output: "user@example.com"

# Date format
engine.generate(
    prompt="Today's date:",
    sampling_params={
        "regex": r"\d{4}-\d{2}-\d{2}",
        "max_new_tokens": 15
    }
)
# Output: "2024-03-15"

ebnf

ebnf
string
default:"null"
EBNF (Extended Backus-Naur Form) grammar to constrain output.
# Mathematical expression grammar
grammar = """
root ::= expression
expression ::= term (["+" "-"] term)*
term ::= factor (["*" "/"] factor)*
factor ::= number | "(" expression ")"
number ::= [0-9]+
"""

engine.generate(
    prompt="Generate a math expression:",
    sampling_params={
        "ebnf": grammar,
        "max_new_tokens": 50
    }
)
# Output: "(5 + 3) * 2 - 7"

# SQL query grammar
sql_grammar = """
root ::= select_stmt
select_stmt ::= "SELECT" column_list "FROM" table_name where_clause?
column_list ::= "*" | column_name ("," column_name)*
where_clause ::= "WHERE" condition
condition ::= column_name "=" value
column_name ::= [a-zA-Z_]+
table_name ::= [a-zA-Z_]+
value ::= [0-9]+ | "'" [a-zA-Z ]+ "'"
"""

engine.generate(
    prompt="Generate a SQL query:",
    sampling_params={
        "ebnf": sql_grammar,
        "max_new_tokens": 100
    }
)
# Output: "SELECT * FROM users WHERE id = 5"

Advanced Parameters

n (Number of Completions)

n
int
default:"1"
Generate N independent completions for each prompt.
# Generate 3 different responses
response = engine.generate(
    prompt="Write a creative opening sentence",
    sampling_params={
        "n": 3,
        "temperature": 0.9,
        "max_new_tokens": 50
    }
)

for i, text in enumerate(response["text"]):
    print(f"Option {i+1}: {text}")

logit_bias

logit_bias
dict
default:"null"
Modify the likelihood of specific tokens. Keys are token IDs, values are bias adjustments. Range: Typically [-100, 100]
# Discourage specific tokens
tokenizer = engine.tokenizer_manager.tokenizer
token_id = tokenizer.encode("bad")[0]

engine.generate(
    prompt="Write a review",
    sampling_params={
        "logit_bias": {str(token_id): -10.0},
        "max_new_tokens": 100
    }
)

sampling_seed

sampling_seed
int
default:"null"
Random seed for reproducible sampling. Set this for deterministic outputs.
# Reproducible generation
for i in range(3):
    response = engine.generate(
        prompt="Generate a random number",
        sampling_params={
            "sampling_seed": 42,
            "temperature": 1.0,
            "max_new_tokens": 10
        }
    )
    print(response["text"])  # Same output each time

skip_special_tokens

skip_special_tokens
bool
default:"true"
Remove special tokens (BOS, EOS, PAD) from decoded output.
# Include special tokens in output
engine.generate(
    prompt="Hello",
    sampling_params={
        "skip_special_tokens": False,
        "max_new_tokens": 20
    }
)

spaces_between_special_tokens

spaces_between_special_tokens
bool
default:"true"
Add spaces between special tokens when decoding.

Parameter Combinations

Creative Writing

sampling_params = {
    "max_new_tokens": 500,
    "temperature": 0.9,
    "top_p": 0.95,
    "frequency_penalty": 0.3,
    "presence_penalty": 0.3
}

Code Generation

sampling_params = {
    "max_new_tokens": 512,
    "temperature": 0.2,
    "top_p": 0.95,
    "stop": ["\n\n", "```"],
    "repetition_penalty": 1.1
}

Factual Q&A

sampling_params = {
    "max_new_tokens": 150,
    "temperature": 0.0,  # Greedy
    "top_k": 1
}

JSON Generation

sampling_params = {
    "max_new_tokens": 300,
    "temperature": 0.3,
    "json_schema": json.dumps(schema)
}

Diverse Brainstorming

sampling_params = {
    "max_new_tokens": 200,
    "temperature": 1.2,
    "top_p": 0.98,
    "presence_penalty": 0.8,
    "n": 5  # Generate 5 ideas
}

Parameter Validation

SGLang validates parameters and raises errors for invalid values:
try:
    engine.generate(
        prompt="test",
        sampling_params={"temperature": -1.0}  # Invalid
    )
except ValueError as e:
    print(f"Invalid parameter: {e}")
Validation Rules:
  • temperature >= 0.0
  • 0.0 < top_p <= 1.0
  • 0.0 <= min_p <= 1.0
  • top_k >= 1 or top_k == -1
  • -2.0 <= frequency_penalty <= 2.0
  • -2.0 <= presence_penalty <= 2.0
  • 0.0 <= repetition_penalty <= 2.0
  • 0 <= min_new_tokens <= max_new_tokens
  • Only one of json_schema, regex, ebnf can be set

See Also