We know how to build smarter robots. Now, we need to learn smarter ways to test them
AI News Desk
·
The Robot Report
··
9 min read
Right now, today, you can spend $14,000 and buy a humanoid robot.
Right now, today, you can spend $14,000 and buy a humanoid robot.
There is no safety certification reviewed, no standardized test protocol verified. You get a machine capable of physical force and real-time autonomous decision-making. And the frameworks for validating its behavior are still catching up to what it can do.
That’s not a criticism of the engineers building these systems. The intelligence side of robotics is advancing at a pace that genuinely deserves the excitement it gets: better perception, more robust locomotion, faster inference, and tighter control loops.
But here’s the question I keep coming back to: As the control architecture of these systems evolves from simple teleoperation all the way to fully autonomous reinforcement learning, are our testing methodologies and safety validation processes evolving with them?
I don’t think they are. Not yet. And I think that gap is worth talking about, not to slow the industry down, but to help it scale responsibly.
Two research papers I’ve worked on recently have shaped how I think about this. One proposes a framework for classifying robot intelligence by its underlying control architecture. The other examines how software safety risk analysis needs to evolve for AI -driven systems.
Together, they point toward something the industry increasingly needs: a testing philosophy that scales alongside autonomy. One where formal safety guarantees replace test-case enumeration at the highest levels, and where adversarial robustness evaluation becomes as routine as functional testing.
Before we can talk about how to test autonomous systems, it helps to be precise about what kind of system we’re actually testing.
In a paper published in IJRCAR in March 2026, I proposed a five-level taxonomy that classifies robots by their cognitive and control architecture, not by how attentive a human operator is — as the SAE driving levels do — but by how the machine itself is processing information and generating behavior.
Levels 0 and 1: Teleoperation and imitation. At Level 0, a human is doing all the thinking. The robot executes intent directly via teleoperation. At Level 1, it has learned to imitate from recorded demonstrations through behavior cloning and can operate without a live operator, but only within the bounds of what it’s seen. The brittleness here is well-documented: Robots trained on clean, structured demonstrations struggle when real-world conditions drift even slightly from training data. A different floor texture, an object placed at an unfamiliar angle. Testing at these levels is relatively tractable, and the tooling is mature.
Level 2: Supervised real-time learning. The robot can detect its own uncertainty, pause safely, request correction, and integrate that correction into its future behavior using inverse reinforcement learning. Testing becomes a two-part challenge: validating the uncertainty detection mechanism itself, and validating the integrity of the learning update triggered by each corrective intervention.
Level 3: Self-supervised learning. The robot generates its own training signals through trial and error, annotating its own successes and failures without human input. Here, the test engineer’s job fundamentally changes. You’re no longer just testing fixed behavior. You’re validating a system that is continuously rewriting its own policy. Testing needs to assess not just current performance, but also the safety of the learning process itself.
Level 4: Reinforcement learning. Full autonomy. The robot frames every task as an optimization problem and solves it through continuous interaction with its environment, often discovering solutions a human couldn’t demonstrate. At this level, traditional test case enumeration breaks down. The behavior space is too large, too dynamic, and too emergent to enumerate exhaustively.
Each level up this ladder doesn’t just add capability. It also adds a fundamentally different type of failure mode and demands a fundamentally different approach to validation.
The go-to risk analysis tool in automotive and robotics software development is FMEA (failure mode and effects analysis). In a co-authored paper published in IRE Journals (2025), we examined the specific limitations of software design FMEA when applied to AI-driven systems, and what a more robust approach looks like.
The core issue is the risk priority number, or RPN, which is FMEA’s standard scoring mechanism. It multiplies Severity, Occurrence, and Detection into a single score. The problem becomes obvious the moment you put numbers to it: a catastrophic failure rated Severity 10, Occurrence 1, Detection 1 scores 10. So does a moderate failure rated Severity 1, Occurrence 1, Detection 10. Same number. Completely different threat.
In a traditional deterministic software system, experienced engineers work around this with judgment. In a neural network-driven system where failure modes are emergent and context-dependent, that judgment is much harder to apply reliably.
The consequences of getting it wrong aren’t just a failed test. They’re deployment delays, liability exposure, and in the worst cases, incidents that set back public trust in an entire product category.
The paper proposes integrating a risk priority matrix alongside HAZOP (hazard and operability study) analysis, methods that evaluate risk through richer contextual lenses rather than collapsing everything into a single number. Grounded in ISO 26262 for functional safety and ISO 21434 for automotive cybersecurity, this combined approach gives engineers a more nuanced vocabulary for reasoning about AI-specific failure modes.
The regulatory backdrop reinforces why this matters. ISO 25785-1 , the first international safety standard for bipedal robots, was published in May 2025 and covers industrial workplace deployment only. ISO 13482 , addressing personal-care robots, was updated in 2025 but predates modern foundation models.
The 2025 revision of ISO 10218-1 for industrial robotics made meaningful progress, but safety researchers are already identifying gaps in AI-driven humanoids and mobile manipulation that the update doesn’t fully close. These standards are essential foundations. They need practitioner input to evolve faster.
So what does a more appropriate testing approach look like across these control levels? Here’s how I think about it.
For Levels 0 and 1, conventional verification and validation methods apply reasonably well. Hardware-in-the-loop (HiL) testing, structured test suites, and systematic boundary testing of the training data distribution are achievable and effective. The key addition for Level 1 is deliberate out-of-distribution (OOD) testing, probing the edges of the training corpus intentionally rather than assuming coverage.
For Level 2, the test strategy needs to expand to cover the learning loop itself. Two things need validation separately:
Logging and replay infrastructure becomes critical. Every human intervention should be recorded, tagged, and reviewed as a potential signal about where the policy is weak.
For Level 3, formal methods start becoming genuinely necessary rather than optional. When a system is rewriting its own policy through self-supervised learning, the safety constraints on that learning process need to be mathematically specified and verified, not just empirically tested.
In practice, the hardest part of Level 3 validation isn’t the tooling; it’s getting alignment on what “safe exploration” actually means for your specific platform before testing begins. Approaches like constrained reinforcement learning and safe exploration algorithms are worth building into the architecture from the start, not retrofitting later. Sim-to-real validation cycles need to explicitly stress-test self-supervised behaviors in edge case environments before any real-world deployment.
For Level 4, the testing philosophy has to shift from test-case enumeration to statistical coverage and formal safety guarantees. Monte Carlo simulation at scale, adversarial environment generation, and domain randomization (the same techniques used in training) should also be core tools in validation. Behavioral specification frameworks that define what the policy must never do, regardless of what it discovers, are as important as performance benchmarks.
One area that deserves particular attention as the industry scales toward Level 4 is federated reinforcement learning, the paradigm where robot fleets share policy updates across a network, distributing compute and accelerating learning convergence.
The efficiency gains are real and significant. But the testing and validation requirements are qualitatively different from single-robot systems.
When policy updates flow peer-to-peer across a fleet, the integrity of those updates needs to be verified at the point of aggregation. Research on federated learning security has documented specific failure modes: data poisoning, where a compromised node submits manipulated updates; backdoor attacks, where a trigger embedded during training causes targeted misbehavior at inference time; and model inversion, where gradient sharing inadvertently leaks information about local training environments. These aren’t theoretical. They are empirically demonstrated.
Testing a federated system, therefore, needs to include adversarial robustness evaluation of the aggregation mechanism, not just the individual policy. Byzantine-fault-tolerant aggregation algorithms like Krum and FedProx, anomaly detection on incoming gradient updates, and cryptographic verification of update provenance are all engineering choices that should be in scope during design and testable during validation. Differential privacy techniques applied at the point of gradient sharing offer another layer of protection, limiting what a compromised update can reveal or corrupt. These aren’t exotic research tools. They’re available, documented, and increasingly necessary to treat as standard practice in any federated deployment.
The progression from Level 0 to Level 4 is genuinely exciting. The capability being demonstrated across autonomous vehicles, humanoid platforms, and industrial systems is real and meaningful. What the industry needs now is a testing philosophy that matures at the same pace.
That means treating safety validation as a first-class design constraint, not a final checkpoint. It means building HAZOP and Risk Priority Matrix analysis into the software development process from the start, not pulling out a FMEA spreadsheet before launch. It means defining what constitutes adequate coverage for a self-supervised or RL-trained system before deploying it, not after the first incident.
And it means giving standards bodies the practitioner feedback they need to evolve ISO 26262 , ISO 21434 , and the emerging bipedal robot standards faster than the technology is outpacing them.
The robots are getting smarter faster than the validation frameworks designed to certify them. Closing that gap isn’t a regulatory problem or a research problem in isolation. It’s an engineering culture problem. It gets solved when testing is treated as a first-class design discipline from day one, not a final gate before launch.
For those working on autonomous systems at any of these levels: at what point does the complexity of the system make traditional test-case enumeration genuinely obsolete, and what have you found actually replaces it? I’d especially like to hear from anyone navigating Level 3 or Level 4 validation in production.
Atharv Kolhar is a staff test automation engineer at Figure AI . There, he works on hardware-in-the-loop test infrastructure for the Figure 03 humanoid robot and the testing of critical robotic software. With a career across humanoid robotics, autonomous lidar sensing, and electric vehicles, he specializes in verification for safety-critical autonomous systems, previously building the software test discipline behind Aeva Technologies’ ASPICE Level 2 certification, with earlier roles at Lucid Motors and NIO.
Kolhar is a voting member of IEEE P2817, the working group writing the international standard for autonomous systems verification, a committee member of ASTM F45.06 on legged robot systems, and a peer reviewer for IEEE IROS and IEEE Transactions on Automation Science and Engineering .
The views expressed in this article are solely his own and do not represent the position, opinion, or stance of his employer or any affiliated organization.
Editor’s notes: This article draws on two published papers: “ Standardizing Robot Control Levels: A Framework for Autonomous Operation, Real-Time Navigation, and Federated Reinforcement Learning ” (IJRCAR, Vol. 14, Issue 3, March 2026) and “ Enhancing Software DFMEA Processes through ISO 26262 and ISO 21434: Addressing RPN Limitations with Risk Priority Matrix and HAZOP Integration ” (IRE Journals, Vol. 8, Issue 7, 2025).
The post We know how to build smarter robots. Now, we need to learn smarter ways to test them appeared first on The Robot Report .