Cisco AI Introduces FAPO: Pipeline-Aware Prompt Optimization With Step-Level Failure Attribution and Claude Code Orchestration
Getting prompts right is still the hardest part of shipping reliable LLM applications.

AI Introduces FAPO: Pipeline-Aware Prompt Optimization With Step-Level Failure Attribution and Claude Code Orchestration">
Getting prompts right is still the hardest part of shipping reliable LLM applications. Small wording changes can swing accuracy by 20 percent. What works on a few examples often breaks at scale. When a multi-step pipeline returns a wrong answer, finding the failing step means inspecting intermediate outputs by hand.
Cisco AI introduced FAPO to address that bottleneck. FAPO stands for Fully Automated Prompt Optimization . It is a Claude Code-driven system that optimizes LLM pipelines from baseline prompts to target accuracy. You supply a dataset and an initial prompt. FAPO then evaluates, classifies failures, proposes variants, validates them, and iterates. The whole loop is orchestrated by Claude Code agents. The project ships open source under Apache 2.0, and also supports Codex as the optimization agent.
In Cisco’s reported evaluation, FAPO beat GEPA, a state-of-the-art prompt optimizer, on 15 of 18 model-benchmark comparisons. On the two benchmarks where FAPO escalated to pipeline changes, the mean gain over GEPA reached +33.8pp.
FAPO is a multi-tenant evaluation and optimization framework. A tenant is a self-contained optimization project. Each tenant directory holds one task’s prompts, dataset, chain definition, scorer, and config. Tenants stay isolated, so unrelated tasks optimize side by side without interference.
The core engine is named hephaestus and is domain-agnostic. It handles evaluation, chain execution, and scoring. Chains are LangGraph state graphs that process each test case. Out of the box, FAPO supports three providers : OpenAI, Baseten, and SageMaker.
The one input you must bring is a dataset. It is paired inputs and expected outputs that define success. FAPO splits it into a validation set and a held-out test set. The validation set drives iteration; the test set is used only for a final one-shot evaluation. From a task description, Claude can scaffold the rest: the initial prompt, the chain, and the scorer.
Once the pieces exist, FAPO runs a closed loop until target accuracy is reached. Each cycle runs six stages:
The system works at three escalating levels. Prompt edits are lowest cost and tried first. Parameter changes adjust config values like retrieval_k or temperature . Structural changes alter chain topology, such as adding a self-reflection node or switching to a ReAct pattern. FAPO exhausts one level before escalating to the next.
Source: MarkTechPost