Partial Failure Handling is a set of strategies used in agentic systems to gracefully manage situations where one or more subtasks in a workflow fail, but others succeed. Instead of treating the entire process as a single atomic unit that either succeeds or fails completely, this approach aims to salvage value from the completed parts of the workflow.

This is particularly critical in parallel execution patterns, where multiple branches run concurrently. If one branch fails (e.g., a tool times out or an agent returns an error), a naive implementation would halt the entire operation, discarding the successful results from the other branches. Robust partial failure handling implements more nuanced logic.

Key strategies include:

  • Retry Failed Branches: Attempting to re-run only the specific subtask that failed, especially for transient errors like network timeouts.
  • Proceed with Partial Results: If the failed task is not critical, the workflow can continue using the results from the successful branches, while explicitly noting the missing information.
  • Fail the Entire Pipeline: In cases where the failed task is essential for all subsequent steps, the only safe option is to terminate the entire workflow.
  • Compensation: For stateful operations, this involves “undoing” the completed steps to return the system to a consistent state after a downstream failure.

A core principle is to never silently drop failed results. The system must always be aware of what is missing, so it can make an informed decision and communicate the incomplete nature of its output to the user or downstream agents.

Connections

Sources