How evaluation benchmarks and key success factors change when the AI does the work and people supervise

I sat in a diligence call last month with a company that automates L1 customer support for D2C brands. Their pricing: Rs X per ticket resolved without human escalation. Not per seat. Per resolution.

We couldn’t build a financial model on it. Revenue was a function of how many tickets their customers received, which was a function of how much their customers sold. Diwali quarter, one account tripled. January, it dropped 55%. Same product. Same resolution rate. NRR was 145% in Q3 and 82% in Q1. Not because anyone churned. Because fewer people were returning kurtas in January.

For SaaS businesses ARR used to be the gospel. Monthly cohorts behaved. NRR told you whether customers loved the product. Those metrics worked because the product had a stable unit of consumption: a human logging in, doing work, generating activity data. When the AI is the primary worker and a human checks a dashboard once a week, the entire metrics architecture collapses.

Here’s what specifically breaks.

DAU and login frequency. Low DAU used to mean low engagement. In an AI-native product, the AI resolves 400 tickets overnight. The ops manager scans exceptions at 10 AM and closes the tab. Product works hardest when nobody is logged in. Kyle Poyar at Growth Unhinged flagged this: classic usage metrics were built for seat-based models where people were the users.

Track token consumption instead. Growing API usage after 90 days means the product is embedding into workflows. Flat means the customer bought a demo. The shape of the curve matters more than the absolute number. A customer burning 2 million tokens and growing 15% month-over-month is healthier than one burning 10 million flat.

NRR. Above 120% used to signal a compounding business. For outcome-priced AI products, NRR swings 50+ points quarter to quarter based on the customer’s business cycle. Optifai’s 2026 study of 939 companies made a sharp point: 100% NRR can mask 20% churn offset by 20% expansion. That’s a treadmill.

Track share of eligible workflow instead. If a customer gets 10,000 tickets and the AI handles 6,000, that’s 60%. Next quarter the AI handles 7,500 out of 9,000. Volume dropped. Share went to 83%. The product earned more trust. NRR would have flagged contraction.

Gross margin. Traditional SaaS runs at 75% to 85% because the marginal cost of serving one more customer is near zero. AI products are different. Every query incurs compute cost. Bessemer’s State of AI 2025 found that fast-scaling startups averaged 25% gross margins. Steadier growers managed 60%. Neither looks like SaaS. For the support company, simple queries ran at 80% margin. Complex troubleshooting ran at 30%. Same product. Same pricing. The headline margin was an average of things that shouldn’t be averaged.

Track margin per outcome tier. Segment by complexity. Ask whether the company is growing into high-margin work (margins expand) or winning enterprise accounts with harder problems (margins compress but retention gets stickier). Both are valid. The headline number hides which game is being played.

The pricing problem nobody has solved 

Outcome-based pricing has two structural flaws. 

First, attribution. “Per ticket resolved” has a clear endpoint. For AI products doing lead scoring or content generation, proving the AI caused the outcome is genuinely hard. The further you move from a binary completion event, the weaker the attribution. 

Second, cost coverage. Seat-based pricing was boring, but it guaranteed the vendor’s minimum cost per customer was covered regardless of usage. Outcome-based pricing removes that floor. A customer who generates few outcomes still consumes onboarding, infrastructure, and support. The vendor absorbs that.

Token-based pricing is the more honest successor to seats. Not outcome-based. Input-based. But it tracks the vendor’s biggest cost line (compute) and passes it through with margin. Seats covered headcount cost in SaaS. Tokens cover compute cost in AI. The economic function is the same: a pricing floor tied to the vendor’s primary cost driver.

Hallucinations and the human oversight tax 

Most AI vendors today disclaim liability for output accuracy. Standard terms say the customer reviews everything before acting on it. That works when a human checks every output. It stops working in autonomous deployments.

Some companies are building SLA-like structures around accuracy rates, with service credits below threshold. But nobody has taken on real financial liability for hallucinations yet. The legal infrastructure doesn’t exist.

If the penalty for a wrong outcome is real (regulatory, financial, reputational), the AI solution needs human validators. QA reviewers. Escalation agents. Field engineers. Those are people with salaries. The cost stack becomes: inference + human review + error remediation + compliance overhead. That’s managed services economics, not software economics.

Track the human-to-automation ratio and its trajectory. At onboarding, 40% of outputs need human review. After six months, if the model has improved, maybe 15%. The rate at which human oversight decreases is a direct measure of product maturity. If the ratio isn’t improving, the company is stuck in managed services margins permanently, regardless of what the pitch deck says.

What is the output acceptance rate 

When the AI produces an output, does the customer accept it as-is, edit it heavily, or discard it? A company with 90% acceptance is building trust. A company with 50% is running an expensive triage layer. This metric connects directly to the cost questions: low acceptance means more human review hours, which means higher delivery costs, which means lower margins. Acceptance rate is the leading indicator that flows through the entire economic model.

What we ask now at Pentathlon

At Pentathlon, AI diligence runs on six questions: 

(1) Is token consumption growing after 90 days? 

(2) What share of the eligible workflow does the AI handle, and is that share growing? 

(3) What is gross profit per outcome, segmented by complexity? 

(4) What is the output acceptance rate and how is it trending? 

(5) What share of ARR is committed vs. usage-projected vs. pilot revenue? 

(6) What does the human-in-the-loop cost look like, and is the human-to-automation ratio improving?

Should you keep using the old metrics?

The old metrics work for subscription businesses with predictable retention and 80% margins. The mistake is assuming they transfer to AI-native businesses where revenue follows the customer’s volume, margins swing by query complexity, liability for errors is unresolved, and the product works hardest when nobody is logged in. 

The reality is messier than any framework captures. We haven’t figured out liability. We haven’t figured out attribution. We haven’t figured out how to price human oversight into a product that’s supposed to eliminate humans. Every evaluation framework we have seen pretends at least one of these problems doesn’t exist.