Tool Error Handling and Retries
Tools fail. APIs time out, databases go offline, inputs are malformed. An agent that treats every ToolResult.error as fatal will halt at the first hiccup. Robust error handling means the agent retries transient failures, falls back to alternatives, and degrades gracefully when nothing works. What you'll build: retry wrappers, fallback tool chains, and a graceful-degradation agent that continues making progress even when individual tools fail. Time: ~15 min. Difficulty: Intermediate
Prerequisites
pip install synapsekit
export OPENAI_API_KEY="sk-..."
What you'll learn
- How
ToolResult.is_errorsignals failure to the agent - Building a
RetryToolwrapper with configurable backoff - Building a
FallbackToolthat tries alternatives in order - How the agent reasons about tool errors and whether to retry
- Logging errors without disrupting the agent loop
Step 1: Understand ToolResult error semantics
import asyncio
from synapsekit.agents import BaseTool, ToolResult
ToolResult.is_error is True when the error field is not None. The __str__ method returns error when set, so the agent always receives a string observation — it never sees None.
# Success
ok = ToolResult(output="42")
print(ok.is_error) # False
print(str(ok)) # "42"
# Error
err = ToolResult(output="", error="Connection timeout after 5s")
print(err.is_error) # True
print(str(err)) # "Connection timeout after 5s"
The agent receives the str(result) as its observation. For errors, this means the LLM sees the error message and can decide whether to retry, try a different tool, or report the failure.
Step 2: Build a RetryTool wrapper
A retry wrapper transparently re-calls the underlying tool when it returns an error. Use exponential backoff to avoid hammering a rate-limited API.
import asyncio
import time
from typing import Any
class RetryTool(BaseTool):
"""Wraps any BaseTool with automatic retry on error."""
def __init__(
self,
inner: BaseTool,
max_retries: int = 3,
initial_delay: float = 1.0,
backoff_factor: float = 2.0,
) -> None:
self._inner = inner
self._max_retries = max_retries
self._initial_delay = initial_delay
self._backoff_factor = backoff_factor
@property
def name(self) -> str:
return self._inner.name
@property
def description(self) -> str:
return self._inner.description
@property
def parameters(self) -> dict:
return self._inner.parameters
def schema(self) -> dict:
return self._inner.schema()
async def run(self, **kwargs: Any) -> ToolResult:
delay = self._initial_delay
for attempt in range(self._max_retries + 1):
result = await self._inner.run(**kwargs)
if not result.is_error:
return result # Success — no retry needed
if attempt < self._max_retries:
# Only sleep between attempts; last failure falls through
print(f"[retry] {self.name} failed (attempt {attempt + 1}): {result.error}")
await asyncio.sleep(delay)
delay *= self._backoff_factor
return result # Return the last error after all retries exhausted
Step 3: Build a FallbackTool chain
A fallback chain tries each tool in order and returns the first success. This is ideal when you have a premium tool (e.g., Tavily) and a free fallback (e.g., DuckDuckGo).
class FallbackTool(BaseTool):
"""Try each tool in order; return the first successful result."""
def __init__(self, tools: list[BaseTool], name: str, description: str) -> None:
if not tools:
raise ValueError("FallbackTool requires at least one tool")
self._tools = tools
self._name = name
self._description = description
self._parameters = tools[0].parameters
@property
def name(self) -> str:
return self._name
@property
def description(self) -> str:
return self._description
@property
def parameters(self) -> dict:
return self._parameters
async def run(self, **kwargs: Any) -> ToolResult:
errors = []
for tool in self._tools:
result = await tool.run(**kwargs)
if not result.is_error:
return result # First success wins
errors.append(f"{tool.name}: {result.error}")
# All tools failed — return a combined error message
combined = "; ".join(errors)
return ToolResult(output="", error=f"All fallbacks failed: {combined}")
Step 4: Build a flaky tool for testing
Always test your error handling logic with a deliberately unreliable tool:
import random
class FlakeySearchTool(BaseTool):
"""Simulates a search tool that fails randomly — for testing retry logic."""
name = "search"
description = "Search the web for information."
parameters = {
"type": "object",
"properties": {"query": {"type": "string", "description": "Search query"}},
"required": ["query"],
}
def __init__(self, failure_rate: float = 0.6) -> None:
self._failure_rate = failure_rate
self._call_count = 0
async def run(self, query: str = "", **kwargs: Any) -> ToolResult:
self._call_count += 1
if random.random() < self._failure_rate:
return ToolResult(output="", error=f"Rate limit exceeded (call #{self._call_count})")
return ToolResult(output=f"Search results for '{query}': [mock result #{self._call_count}]")
Step 5: Add error logging without disrupting the agent
Wrap run() to log errors to a side channel while still returning the ToolResult to the agent:
class LoggingTool(BaseTool):
"""Wraps a tool to log all errors to a collector."""
def __init__(self, inner: BaseTool) -> None:
self._inner = inner
self.errors: list[dict] = []
@property
def name(self) -> str:
return self._inner.name
@property
def description(self) -> str:
return self._inner.description
@property
def parameters(self) -> dict:
return self._inner.parameters
async def run(self, **kwargs: Any) -> ToolResult:
result = await self._inner.run(**kwargs)
if result.is_error:
self.errors.append({"tool": self.name, "error": result.error, "kwargs": kwargs})
return result # Always forward the result — logging never blocks the agent
Step 6: Steer the agent to handle errors in system prompt
The agent's behavior when it receives a tool error depends on its system prompt. Be explicit:
from synapsekit.agents import FunctionCallingAgent
from synapsekit.llms.openai import OpenAILLM
agent = FunctionCallingAgent(
llm=OpenAILLM(model="gpt-4o-mini"),
tools=[RetryTool(FlakeySearchTool(failure_rate=0.5), max_retries=2)],
system_prompt=(
"You are a research assistant. "
"If a tool returns an error message, try calling it again once with the same or simplified input. "
"If it fails twice in a row, inform the user that the service is temporarily unavailable "
"and provide the best answer you can from your training knowledge."
),
max_iterations=8,
)
Complete working example
import asyncio
import random
from typing import Any
from synapsekit.agents import BaseTool, FunctionCallingAgent, ToolResult
from synapsekit.llms.openai import OpenAILLM
class FlakeySearchTool(BaseTool):
name = "web_search"
description = "Search the web. May occasionally fail due to rate limits."
parameters = {
"type": "object",
"properties": {"query": {"type": "string"}},
"required": ["query"],
}
def __init__(self, failure_rate: float = 0.5) -> None:
self._failure_rate = failure_rate
self._calls = 0
async def run(self, query: str = "", **kwargs: Any) -> ToolResult:
self._calls += 1
if random.random() < self._failure_rate:
return ToolResult(output="", error="503 Service Unavailable")
return ToolResult(output=f"Results for '{query}': [article 1, article 2, article 3]")
class RetryTool(BaseTool):
def __init__(self, inner: BaseTool, max_retries: int = 2, delay: float = 0.5) -> None:
self._inner = inner
self._max_retries = max_retries
self._delay = delay
@property
def name(self) -> str: return self._inner.name
@property
def description(self) -> str: return self._inner.description
@property
def parameters(self) -> dict: return self._inner.parameters
async def run(self, **kwargs: Any) -> ToolResult:
for attempt in range(self._max_retries + 1):
result = await self._inner.run(**kwargs)
if not result.is_error:
return result
if attempt < self._max_retries:
print(f" [retry {attempt + 1}/{self._max_retries}] {result.error}")
await asyncio.sleep(self._delay)
return result
class FallbackTool(BaseTool):
def __init__(self, tools: list[BaseTool], name: str, description: str) -> None:
self._tools = tools
self._name = name
self._description = description
@property
def name(self) -> str: return self._name
@property
def description(self) -> str: return self._description
@property
def parameters(self) -> dict: return self._tools[0].parameters
async def run(self, **kwargs: Any) -> ToolResult:
errors = []
for t in self._tools:
result = await t.run(**kwargs)
if not result.is_error:
return result
errors.append(f"{t.name}: {result.error}")
return ToolResult(output="", error="All options failed: " + "; ".join(errors))
async def main() -> None:
random.seed(42)
# Scenario 1: retry wrapper around a flaky tool
print("=== Scenario 1: Retry wrapper ===")
flakey = FlakeySearchTool(failure_rate=0.6)
reliable = RetryTool(flakey, max_retries=3, delay=0.1)
for q in ["quantum computing", "Python 3.13 features", "SynapseKit release"]:
result = await reliable.run(query=q)
status = "ERROR" if result.is_error else "OK"
print(f" [{status}] {q}: {str(result)[:70]}")
print(f"\n Total underlying calls: {flakey._calls}")
# Scenario 2: fallback chain
print("\n=== Scenario 2: Fallback chain ===")
always_fails = FlakeySearchTool(failure_rate=1.0) # always fails
always_fails.name = "premium_search"
backup = FlakeySearchTool(failure_rate=0.0) # always succeeds
backup.name = "fallback_search"
fallback = FallbackTool(
tools=[always_fails, backup],
name="search",
description="Search with automatic fallback.",
)
result = await fallback.run(query="machine learning news")
print(f" Result: {result.output}")
# Scenario 3: full agent with error handling
print("\n=== Scenario 3: Agent with error recovery ===")
agent = FunctionCallingAgent(
llm=OpenAILLM(model="gpt-4o-mini"),
tools=[RetryTool(FlakeySearchTool(failure_rate=0.4), max_retries=2)],
system_prompt=(
"You are a helpful assistant. If a tool returns an error, "
"retry once. If it still fails, answer from your knowledge."
),
max_iterations=6,
)
answer = await agent.run("What is the capital of Japan?")
print(f" Answer: {answer}")
asyncio.run(main())
Expected output
=== Scenario 1: Retry wrapper ===
[retry 1/3] 503 Service Unavailable
[OK] quantum computing: Results for 'quantum computing': [article 1, article 2, artic
[OK] Python 3.13 features: Results for 'Python 3.13 features': [article 1, article 2
[retry 1/3] 503 Service Unavailable
[retry 2/3] 503 Service Unavailable
[OK] SynapseKit release: Results for 'SynapseKit release': [article 1, article 2, ar
Total underlying calls: 8
=== Scenario 2: Fallback chain ===
Result: Results for 'machine learning news': [article 1, article 2, article 3]
=== Scenario 3: Agent with error recovery ===
[retry 1/2] 503 Service Unavailable
Answer: The capital of Japan is Tokyo.
How it works
ToolResult.is_error checks self.error is not None. The agent receives str(result) as its observation string — for errors, this is the error message. The LLM reads the error message and applies the instructions in system_prompt to decide what to do next: retry, try a different tool, or answer from knowledge.
The RetryTool wrapper is transparent to the agent because it preserves name, description, and parameters from the inner tool. The agent cannot tell it is calling a wrapped version — it sees the same schema and tool name.
The FallbackTool hides multiple implementations behind a single tool name. The agent makes one call; the fallback logic is entirely in Python. This avoids cluttering the agent's context with error/retry reasoning when a Python-level fallback is sufficient.
Variations
Classify errors before retrying to avoid retrying non-recoverable failures:
async def run(self, **kwargs: Any) -> ToolResult:
result = await self._inner.run(**kwargs)
if result.is_error:
# Do not retry validation errors — only transient errors
if "required" in result.error or "invalid" in result.error.lower():
return result
# Retry network/timeout errors
await asyncio.sleep(self._delay)
return await self._inner.run(**kwargs)
return result
Track error metrics for monitoring:
from collections import Counter
class MetricsTool(BaseTool):
def __init__(self, inner: BaseTool) -> None:
self._inner = inner
self.call_count = 0
self.error_count = 0
self.error_types: Counter = Counter()
async def run(self, **kwargs: Any) -> ToolResult:
self.call_count += 1
result = await self._inner.run(**kwargs)
if result.is_error:
self.error_count += 1
self.error_types[result.error[:40]] += 1
return result
Troubleshooting
Agent retries indefinitely — max_iterations is the agent-level cap, not the tool-level cap. A retry wrapper with max_retries=3 that always fails will trigger the agent to retry the tool call from the LLM side too. Set a low max_retries in the wrapper and rely on the system prompt to tell the agent when to give up.
RetryTool.name is None — BaseTool.name is a class attribute, not an instance attribute. When forwarding with a @property, ensure the property is accessible from both the class and instance.
Fallback order is wrong — list tools in order of preference: fastest/cheapest first, most reliable last. The fallback stops at the first success, so the most reliable tool should be last if you want it used only as a last resort.
Backoff delay blocks the event loop — use asyncio.sleep(), not time.sleep(). A synchronous sleep in an async tool freezes the entire event loop.
Next steps
- Multi-Tool Orchestration — apply retry and fallback wrappers to a diverse toolset
- Agent with Safety Guardrails — combine error handling with input/output validation
- Creating Custom Tools — build error-resilient tools from scratch with explicit validation