ScarfBench: Benchmarking AI Agents for Enterprise Java Framework Migration
Modernizing enterprise applications is one of the largest and most expensive software engineering activities organizations undertake.

Modernizing enterprise applications is one of the largest and most expensive software engineering activities organizations undertake. Teams migrate applications across frameworks to improve maintainability, cloud readiness, developer productivity, and access to modern capabilities.
Recent advances in coding agents have sparked excitement around AI-assisted modernization. But an important question remains:
Can AI agents reliably modernize real-world enterprise applications?
Existing software engineering benchmarks have demonstrated impressive progress in bug fixing and code generation, but framework migration presents a fundamentally different challenge. Success requires not only translating code, but also preserving behavior, adapting build systems, and navigating runtime dependencies.
To address this gap, we introduce ScarfBench (Self-Contained Application Refactoring Benchmark) , an open benchmark for evaluating AI agents on cross-framework migration tasks in Enterprise Java.
ScarfBench focuses on migrations across three major Java ecosystems:
Unlike traditional benchmarks that compare generated code against reference implementations, ScarfBench evaluates whether migrated applications actually build, deploy, and preserve behavior.
Framework migration is much more than replacing annotations.
A simple repository migration can require changes across dependency injection, persistence configuration, queries, and framework descriptors. Small mistakes in any of these pieces can prevent successful deployment.
Figure: Spring → Jakarta Migration Example
Framework migration requires translating framework semantics, not just source code.
ScarfBench provides a systematic way to evaluate AI agents on enterprise Java framework migration tasks.
This provides a much more realistic measure of modernization quality.
ScarfBench includes both focused migration tasks and whole-application migrations.
Starting from a JSR-based enterprise Java taxonomy, expert migrations create verified implementations across Spring, Jakarta EE, and Quarkus.
We evaluated several state-of-the-art coding agents on ScarfBench.
Source: Hugging Face