Prompt Optimization
Testing and improving prompts
A good prompt gets the job done. An optimized prompt gets the job done efficiently—faster, cheaper, more consistently. This chapter teaches you how to systematically improve prompts across multiple dimensions.
Want to optimize your prompts automatically? Use our Prompt Enhancer tool. It analyzes your prompt, applies optimization techniques, and shows you similar community prompts for inspiration.
The Optimization Trade-offs
Every optimization involves trade-offs. Understanding these helps you make intentional choices:
Higher quality often requires more tokens or better models
Adding examples improves accuracy but increases token count
Faster models may sacrifice some capability
GPT-4 is smarter but slower than GPT-4o-mini
Lower temperature = more predictable but less creative
Temperature 0.2 for facts, 0.8 for brainstorming
Edge case handling adds complexity
Simple prompts fail on unusual inputs
Measuring What Matters
Before optimizing, define success. What does "better" mean for your use case?
How often is the output correct?
90% of code suggestions compile without errors
Does it address what was actually asked?
Response directly answers the question vs. tangents
Are all requirements covered?
All 5 requested sections included in output
How long until the response arrives?
p50 < 2s, p95 < 5s for chat applications
How many tokens for the same result?
500 tokens vs. 1500 tokens for equivalent output
How similar are outputs for similar inputs?
Same question gets structurally similar answers
Percentile metrics show response time distribution. p50 (median) means 50% of requests are faster than this value. p95 means 95% are faster—it catches slow outliers. If your p50 is 1s but p95 is 10s, most users are happy but 5% experience frustrating delays.
Use this template to clarify what you're optimizing for before making changes.
Help me define success metrics for my prompt optimization.
**My use case**: ${useCase}
**Current pain points**: ${painPoints}
For this use case, help me define:
1. **Primary metric**: What single metric matters most?
2. **Secondary metrics**: What else should I track?
3. **Acceptable trade-offs**: What can I sacrifice for the primary metric?
4. **Red lines**: What quality level is unacceptable?
5. **How to measure**: Practical ways to evaluate each metricToken Optimization
Tokens cost money and add latency. Here's how to say the same thing with fewer tokens.
The Compression Principle
Verbose (67 tokens)
I would like you to please help me with the following task. I need you to take the text that I'm going to provide below and create a summary of it. The summary should capture the main points and be concise. Please make sure to include all the important information. Here is the text: [text]
Concise (12 tokens)
Summarize this text, capturing main points concisely: [text]
Same result, 82% fewer tokens.
Token-Saving Techniques
"Please" and "Thank you" add tokens without improving output
"Please summarize" → "Summarize"
Don't repeat yourself or state the obvious
"Write a summary that summarizes" → "Summarize"
Where meaning is clear, abbreviate
"for example" → "e.g."
Point to content instead of repeating it
"the text above" instead of re-quoting
Paste a verbose prompt to get a token-optimized version.
Compress this prompt while preserving its meaning and effectiveness:
Original prompt:
"${verbosePrompt}"
Instructions:
1. Remove unnecessary pleasantries and filler words
2. Eliminate redundancy
3. Use concise phrasing
4. Keep all essential instructions and constraints
5. Maintain clarity—don't sacrifice understanding for brevity
Provide:
- **Compressed version**: The optimized prompt
- **Token reduction**: Estimated percentage saved
- **What was cut**: Brief explanation of what was removed and why it was safe to removeQuality Optimization
Sometimes you need better outputs, not cheaper ones. Here's how to improve quality.
Accuracy Boosters
Ask the model to check its own work
"...then verify your answer is correct"
Make uncertainty explicit
"Rate your confidence 1-10 and explain any uncertainty"
Get different perspectives, then choose
"Provide 3 approaches and recommend the best one"
Force step-by-step thinking
"Think step by step and show your reasoning"
Consistency Boosters
Show exactly what output should look like
Include a template or schema
Provide 2-3 examples of ideal output
"Here's what good looks like: [examples]"
Reduce randomness for more predictable output
Temperature 0.3-0.5 for consistent results
Add a validation step for critical fields
"Verify all required fields are present"
Add quality-improving elements to your prompt.
Enhance this prompt for higher quality outputs:
Original prompt:
"${originalPrompt}"
**What quality issue I'm seeing**: ${qualityIssue}
Add appropriate quality boosters:
1. If accuracy is the issue → add verification steps
2. If consistency is the issue → add format specifications or examples
3. If relevance is the issue → add context and constraints
4. If completeness is the issue → add explicit requirements
Provide the enhanced prompt with explanations for each addition.Latency Optimization
When speed matters, every millisecond counts.
Model Selection by Speed Need
Use smallest effective model + aggressive caching
GPT-4o-mini, Claude Haiku, cached responses
Fast models, streaming enabled
GPT-4o-mini with streaming
Mid-tier models, balance quality/speed
GPT-4o, Claude Sonnet
Use best model, process in background
GPT-4, Claude Opus for offline processing
Speed Techniques
Fewer input tokens = faster processing
Compress prompts, remove unnecessary context
Set max_tokens to prevent runaway responses
max_tokens: 500 for summaries
Get first tokens faster, better UX
Stream for any response > 100 tokens
Don't recompute identical queries
Cache common questions, template outputs
Cost Optimization
At scale, small savings multiply into significant budget impact.
Understanding Costs
Use this calculator to estimate your API costs across different models:
Cost Reduction Strategies
Use expensive models only when needed
Simple questions → GPT-4o-mini, Complex → GPT-4
Shorter prompts = lower cost per request
Cut 50% of tokens = 50% input cost savings
Limit response length when full detail isn't needed
"Answer in 2-3 sentences" vs. unlimited
Combine related queries into single requests
Analyze 10 items in one prompt vs. 10 separate calls
Don't send requests that don't need AI
Keyword matching before expensive classification
The Optimization Loop
Optimization is iterative. Here's a systematic process:
Step 1: Establish Baseline
You can't improve what you don't measure. Before changing anything, document your starting point rigorously.
Save the exact prompt text, including system prompts and any templates
Version control your prompts like code
Create 20-50 representative inputs that cover common cases and edge cases
Include easy, medium, and hard examples
Score each output against your success criteria
Accuracy %, relevance score, format compliance
Measure tokens and timing for each test case
Avg input: 450 tokens, Avg output: 200 tokens, p50 latency: 1.2s
Use this to create a comprehensive baseline before optimizing.
Create a baseline documentation for my prompt optimization project.
**Current prompt**:
"${currentPrompt}"
**What the prompt does**: ${promptPurpose}
**Current issues I'm seeing**: ${currentIssues}
Generate a baseline documentation template with:
1. **Prompt Snapshot**: The exact prompt text (for version control)
2. **Test Cases**: Suggest 10 representative test inputs I should use, covering:
- 3 typical/easy cases
- 4 medium complexity cases
- 3 edge cases or difficult inputs
3. **Metrics to Track**:
- Quality metrics specific to this use case
- Efficiency metrics (tokens, latency)
- How to score each metric
4. **Baseline Hypothesis**: What do I expect the current performance to be?
5. **Success Criteria**: What numbers would make me satisfied with optimization?Step 2: Form a Hypothesis
Vague goal
I want to make my prompt better.
Testable hypothesis
If I add 2 few-shot examples, accuracy will improve from 75% to 85% because the model will learn the expected pattern.
Step 3: Test One Change
Change one thing at a time. Run both versions on the same test inputs. Measure the metrics that matter.
Step 4: Analyze and Decide
Did it work? Keep the change. Did it hurt? Revert. Was it neutral? Revert (simpler is better).
Step 5: Repeat
Generate new hypotheses based on what you learned. Keep iterating until you hit your targets or reach diminishing returns.
Optimization Checklist
You have a prompt that works well but costs too much at scale. What's the FIRST thing you should do?