Adaptive Token Limit Management #3843

sarojrout · 2025-12-05T00:13:45Z

sarojrout
Dec 5, 2025

Is your feature request related to a problem? Please describe.
Yes. There are several related problems:

Yes. The current max_tokens is hardcoded to 8192 (e.g., in Claude model at src/google/adk/models/anthropic_llm.py:298). This causes two critical issues:

Truncated Responses: When responses exceed 8192 tokens, they get cut off mid-sentence, leaving users with incomplete answers
Token Waste: When responses are shorter, we waste allocated tokens and increase costs unnecessarily
No Adaptation: There's no way to dynamically adjust based on response quality, context window availability, or user needs
Manual Workarounds Required: Developers must manually count tokens and switch prompt engineering techniques, which is error-prone and doesn't scale

Real-World Impact

Personal Experience: While building an agentic system that queries databases to provide recommendations, I encountered this issue frequently. The system would retrieve data and generate responses that sometimes exceeded 8192 tokens, causing critical information to be truncated.

The Manual Workaround I Had to Implement:

Manually count tokens before sending requests
Switch between different prompt engineering techniques based on estimated token count
Use shorter, more concise prompts when approaching limits
Sometimes split responses across multiple invocations
Monitor token usage and adjust prompts reactively

This same issue also occurred when using OpenAI models in other projects - responses would sometimes generate more tokens than expected, requiring similar manual intervention.

The Problem This Creates:

Inconsistent Quality: Different prompt styles based on token count lead to inconsistent user experience
Development Overhead: Constant monitoring and manual adjustment is time-consuming
Production Risk: Truncated responses in production can lead to incomplete or incorrect information being delivered to users
Cost Inefficiency: Fixed limits waste tokens when responses are shorter, and fail when they're longer

Example Problem:

agent = Agent(
    name="code_generator",
    model=Claude(model="claude-3-5-sonnet-v2@20241022"),  # max_tokens=8192 fixed
    instruction="Generate complete code solutions"
)
# If code generation needs 10000 tokens, it gets truncated at 8192

Describe the solution you'd like
I propose adding several complementary features:

Add an Adaptive Token Limit Management feature that automatically adjusts max_tokens based on response quality and context.

Proposed API:

class AdaptiveTokenConfig(BaseModel):
    """Adaptive token limit configuration."""
    enabled: bool = False
    min_tokens: int = 512
    max_tokens: int = 8192
    initial_tokens: int = 2048
    increase_factor: float = 1.5  # Multiply if response truncated
    decrease_factor: float = 0.8  # Reduce if response much shorter
    truncation_detection: bool = True  # Auto-detect incomplete responses

Usage:

from google.adk import Agent, App
from google.adk.models.anthropic_llm import Claude
from google.adk.agents.run_config import RunConfig, AdaptiveTokenConfig

agent = Agent(
    name="code_generator",
    model=Claude(model="claude-3-5-sonnet-v2@20241022"),
    instruction="Generate complete code solutions"
)

app = App(
    name="my_app",
    root_agent=agent
)

# Use with adaptive tokens
run_config = RunConfig(
    adaptive_token_config=AdaptiveTokenConfig(
        enabled=True,
        initial_tokens=2048,
        max_tokens=16384  # Allow up to 16k if needed
    )
)

Behavior:

Start with initial_tokens (e.g., 2048) for first invocation
Detect truncation: Check if response ends mid-sentence, has incomplete function calls, or hits MAX_TOKENS finish reason
Auto-increase: If truncated, multiply by increase_factor (e.g., 2048 → 3072 → 4608 → ...)
Auto-decrease: If response uses < 50% of allocated tokens, reduce by decrease_factor for next invocation
Respect bounds: Never go below min_tokens or above max_tokens
Session-aware: Track across invocations in the same session

Describe alternatives you've considered

Manual Adjustment: Users could manually set max_tokens per invocation, but:
- Requires deep knowledge of token usage patterns
- Doesn't adapt to changing needs
- Adds cognitive overhead
Fixed Higher Limits: Simply increasing default to 16384:
- Wastes tokens when not needed
- Doesn't solve truncation for very long responses
- Increases costs unnecessarily
External Tracking: Building token management outside ADK:
- Loses integration with execution flow
- Harder to enforce at the right points
- Duplicates functionality

Additional context

Use Cases:

Long-form content generation: Articles, reports that may exceed 8192 tokens
Code generation: Complete functions/systems that need full responses
Multi-step reasoning: Explanations that require full context
Cost optimization: Reduce token waste in production deployments

Implementation Notes:

Can be implemented as an extension to RunConfig
Backward compatible (opt-in via enabled=True)
Should integrate with existing ContextCacheConfig for cost savings
Can track truncation via FinishReason.MAX_TOKENS in response

Related Code:

Current implementation: src/google/adk/models/anthropic_llm.py:298 (max_tokens: int = 8192)
RunConfig: src/google/adk/agents/run_config.py

Priority:

High - This directly addresses a critical production issue that forces developers to implement manual workarounds. The problem affects:

Production Reliability: Truncated responses can deliver incomplete or incorrect information to end users
Developer Productivity: Manual token counting and prompt switching is time-consuming and error-prone
Cost Optimization: Fixed limits waste tokens when responses are shorter, and fail when they're longer
Scalability: Manual interventions don't scale as systems grow in complexity

This feature would eliminate the need for manual token management and prompt engineering workarounds, allowing developers to focus on building better agentic systems rather than managing token limits.

ankursharmas · 2025-12-05T19:44:19Z

ankursharmas
Dec 5, 2025
Maintainer

You can override the value of max_tokens to whatever you want. 8192 is just a default value.

model=Claude(model="claude-3-5-sonnet-v2@20241022", max_tokens=)

1 reply

sarojrout Dec 5, 2025
Author

Hello @ankursharmas , Thank you for the response. However, the ability to set a fixed max_tokens value doesn't address the core feature request for automatic adaptation based on response quality and usage patterns.

What you can do now:

# Set a fixed max_tokens value
model = Claude(model="claude-3-5-sonnet-v2@20241022", max_tokens=16384)

Yes, it allows setting a custom value
But it's Fixed - doesn't adapt
It's a Manual process - you must choose the value
No learning - doesn't improve over time

What I requested to tackle this issue is: Adaptive Token Management

# Automatic adaptation based on usage
run_config = RunConfig(
    adaptive_token_config=AdaptiveTokenConfig(
        enabled=True,
        initial_tokens=2048,  # Start low
        max_tokens=16384      # Allow up to 16k if needed
    )
)

Automatic - adapts based on truncation detection
Dynamic - increases if truncated, decreases if underutilized
Session-aware - learns from previous invocations
Cost-optimized - starts low, increases only when needed

Why Fixed Values Don't Solve This

Problem 1: You Don't Know What Value to Set

# What should I set?
model = Claude(model="claude-3-5-sonnet-v2@20241022", max_tokens=???)
# - 8192? Might truncate
# - 16384? Wastes tokens when responses are shorter
# - 12000? How do I know?

As I explicitly stated in the issue based on my experience:

"Manual Workarounds Required: Developers must manually detect truncated responses, retry with higher max_tokens values, and continuously monitor and adjust token limits based on response patterns — a process that doesn't scale in production systems."

Setting a fixed value still requires manual work - you must:

Guess the right value
Monitor for truncation
Manually adjust if needed
Handle different scenarios differently

Problem 2: No Automatic Truncation Detection

Even with a fixed max_tokens=16384, you still need to:

Manually check FinishReason.MAX_TOKENS
Manually detect incomplete responses
Manually decide to increase the value
Manually track which invocations were truncated

Problem 3: No Cost Optimization

# Scenario 1: Response uses 2000 tokens
# You set max_tokens=16384
# Result: 14384 tokens wasted = wasted cost

# Scenario 2: Response needs 20000 tokens  
# You set max_tokens=16384
# Result: Still truncated, incomplete response

Fixed values can't optimize - they're either too high (waste) or too low (truncate).

Problem 4: No Adaptation

The value you set doesn't learn or adapt:

If responses are consistently shorter → still wastes tokens
If responses are consistently longer → still truncates
Different queries need different amounts → can't adapt per-query

Real-World Example

Current approach (fixed value):

model = Claude(model="claude-3-5-sonnet-v2@20241022", max_tokens=16384)

# Query 1: Simple question → Uses 500 tokens → 15884 wasted
# Query 2: Complex analysis → Needs 20000 tokens → Truncated at 16384
# Query 3: Medium question → Uses 3000 tokens → 13384 wasted

Adaptive approach (which i requested):

run_config = RunConfig(
    adaptive_token_config=AdaptiveTokenConfig(enabled=True, initial_tokens=2048)
)

# Query 1: Simple question → Uses 500 tokens → Next time: 1600 tokens (learned)
# Query 2: Complex analysis → Truncated at 2048 → Auto-increase to 3072 → Still truncated → Auto-increase to 4608 → Success
# Query 3: Medium question → Uses 3000 tokens → Next time: 2400 tokens (learned)

The Core Issue

The issue isn't about setting a value - it's about adapting the value automatically based on:

Truncation detection
Actual token usage
Response quality
Session history

Conclusion

While setting a fixed max_tokens value is useful, it doesn't address the need for automatic adaptation that:

Eliminates manual token management
Optimizes costs automatically
Adapts to different query patterns
Learns from usage over time

The feature request is for intelligent, adaptive token management, not just the ability to set a fixed value. These are complementary but distinct capabilities.

Pls let me know if you would like me to reopen this issue and work on it or provide more details on the adaptive token management use case as i faced the issues in the past, so thought of a good feature to add.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adaptive Token Limit Management #3843

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Adaptive Token Limit Management #3843

Uh oh!

Uh oh!

sarojrout Dec 5, 2025

Real-World Impact

Proposed API:

Usage:

Behavior:

Use Cases:

Implementation Notes:

Related Code:

Priority:

Replies: 1 comment · 1 reply

Uh oh!

ankursharmas Dec 5, 2025 Maintainer

Uh oh!

Uh oh!

sarojrout Dec 5, 2025 Author

What you can do now:

What I requested to tackle this issue is: Adaptive Token Management

Why Fixed Values Don't Solve This

Problem 1: You Don't Know What Value to Set

Problem 2: No Automatic Truncation Detection

Problem 3: No Cost Optimization

Problem 4: No Adaptation

Real-World Example

The Core Issue

Conclusion

sarojrout
Dec 5, 2025

Replies: 1 comment 1 reply

ankursharmas
Dec 5, 2025
Maintainer

sarojrout Dec 5, 2025
Author