KubeCon EU 2026 had plenty of the usual cloud-native chest-thumping. But one of the more honest themes buzzing around the conference floor was this: AI “agents” are getting shoved into production, and companies are tired of trusting vibes.
Solo.io showed up with a new tool called Agentevals, pitched as a way to measure whether agent workflows are actually any good—consistent, efficient, and worth the money—using telemetry data (traces, metrics, logs) plus agent-specific scoring.
Because here’s the dirty secret product teams and SREs already know: an agent can look brilliant on Tuesday and turn into a confused intern on Wednesday. Change a prompt. Swap a model. Hit a slower third-party API. Suddenly your “automation” is burning more compute, taking longer, and making worse decisions. And in multi-agent chains, those little failures don’t stay little—they cascade.
AI agents don’t fail like microservices—and that’s the problem
Traditional observability is great at telling you a service is slow, a pod is crashing, or an API is throwing 500s. It’s a lot worse at answering the question your boss actually cares about: did the agent make the right call, and how much did it cost us to get there?
Agent systems aren’t just request/response. They plan, call tools, fetch documents, loop, verify, and sometimes hand off to other agents. That “reasoning” is distributed across components and vendors, stitched together with frameworks that don’t share a common language.
So when something breaks—or just quietly gets worse—teams end up debugging by superstition: tweak prompts, rerun tests, hope the graphs look better.
What Solo.io says Agentevals does: score the workflow, not the hype
Solo.io’s pitch is that Agentevals fills the missing link between raw technical signals and business reality. The company says it combines existing telemetry with proprietary agent-focused metrics to evaluate things like:
– output quality (did it answer correctly?)
- instruction compliance (did it follow the rules?)
- task success rate (did it actually complete the job?)
- execution efficiency (how long, how many steps, how many tokens, how many tool calls?)
- external tool invocation
- context/document retrieval
- response validation
- action execution (opening a ticket, changing a config, calling an API)
The key promise isn’t just “visibility.” It’s comparability: the ability to put two versions of an agent workflow side-by-side and see whether you improved it—or introduced a regression.
That sounds basic, but it’s exactly what’s been missing for a lot of companies trying to graduate from prototype to production. Software teams know how to ship code with acceptance criteria. Agent teams, too often, ship “seems fine.”
Telemetry is the foundation—because nobody wants another observability stack
Solo.io is leaning hard on a practical point: enterprises already have telemetry pipelines and dashboards. They already collect traces, metrics, and logs in Kubernetes-heavy environments. Asking them to rip that out and start over is a non-starter.
Agentevals is positioned as an evaluation layer that rides on top of what’s already there, enriching the data with agent-specific events—things a normal microservice doesn’t emit, like:
– start/end of a planning step
The make-or-break detail is granularity. If you can’t see the workflow step-by-step, you can’t audit it, tune it, or control costs. If you can, you can start treating agent behavior like any other production system: measurable, testable, and rollback-able.
This isn’t just an SRE toy—product, compliance, and security all want receipts
As agents move from “cool demo” to “thing that touches customers and systems,” the tolerance for gray areas drops fast. CIOs don’t love mystery meat automation that opens tickets, changes configurations, or talks to customers with no paper trail.
For SRE teams, the question gets blunt: how do you set an SLO for an agent? “Low latency” is meaningless if the agent is confidently wrong. “High accuracy” is cute until the token bill spikes and the workflow starts timing out.
Product teams want to know what’s actually delivering value. Maybe the agent reduces time-to-first-response but increases ticket reopen rates because it’s sloppy. Without measurement, you’re just arguing in meetings.
Compliance teams want auditability: what sources were consulted, what tools were called, in what order. Full explainability is a fantasy in most real deployments, but execution logs—structured well—can at least show what happened.
And security? Agents are magnets for prompt injection and other nasty inputs. An evaluation layer won’t replace security controls, but it can surface weird behavior patterns—like sudden spikes in tool calls or strange deviations in outputs—that deserve attention.
Why Solo.io picked KubeCon to plant this flag
KubeCon is where cloud-native vendors go to declare, “This is now a real production category.” By launching Agentevals here, Solo.io is betting that agent evaluation becomes a standard extension of observability—right alongside traffic management and API security.
The market is clearly forming: big observability platforms are bolting on AI features, while specialist tools are popping up to track model calls and grade outputs. Solo.io—best known for infrastructure and networking chops in distributed systems—wants a piece of the “agents are workloads now” shift.
The smart angle is consolidation. Enterprises hate console sprawl. If agent evaluation plugs into existing telemetry and workflows, it has a fighting chance.
The risk is standardization—or the lack of it. Observability got easier when the industry agreed on conventions for traces and metrics. Agent workflows don’t have that shared vocabulary yet. If Agentevals can’t keep up with fast-changing agent frameworks and model stacks, it’ll get left behind.
But if it can turn agent evaluation into something teams can act on—rollback a workflow, swap a model, tighten guardrails, cap costs—then Solo.io isn’t selling another dashboard. It’s selling control. And right now, control is what companies are shopping for.


