AI · SaaS · user research2026Engagement: ongoingLive: production

Cheaper inference, faster reports.

User Evaluation runs AI research analysis for product teams at Shopify, Samsung, SAP and Tencent. Behind every customer interaction sits a pipeline — transcription, multimodal chat, automated reports — and at scale that pipeline carries a real bill. We work on the layer that makes it cheaper and faster without softening what it does.

38%

HeadlineDrop in per-job inference cost across the synthesis pipeline, measured against the same workload month-on-month with their evals held flat.

01 — The brief

Volume kept climbing. The bill noticed.

The platform processes hours of audio, video and text per session. Every step is an LLM or transcription call. Every call has a price tag and a tail. As traffic grew, two things stopped being negotiable: cost per job and tail latency on synthesis.

The team had a working pipeline and a long backlog. They didn't need a rewrite. They needed someone to do the unglamorous middle layer — measure every call, decide what to swap, and prove the swap didn't regress quality on their own benchmarks.

02 — What we did

Audit. Swap. Measure. Repeat.

We started with a full audit — every step in the pipeline, with cost and latency attached to each one. Most of the bill sat in two places. We worked on those.

Per-task modelsSwapped models step by step where the output held up under their internal evals — keeping the heavy ones only where they earned their keep.
Batched transcriptionMoved transcription to batch processing where the SLA allowed; same content, half the unit cost.
Prompt compressionTightened the heavy synthesis prompts and moved retrieval into a cache for repeated lookups across a session.
Eval harnessBuilt a regression harness so every change ran against their internal quality bar before it shipped near production.
ObservabilityPer-call cost and latency telemetry the team could read at a glance, so the gains don't quietly drift back.

OpenAI·Anthropic·Whisper·Postgres·their existing infra·neu eval harness

03 — Result

Lower bill. Same answers. Faster.

38%Drop in per-job inference cost on the synthesis pipeline.

2.1×Faster median end-to-end synthesis at the new model mix.

0%Regression on their internal evals across the optimised paths.

The work continues. Each new model release opens another small swap; the eval harness keeps the floor honest.

“
Boring work, important work. Measure every call, decide what's worth swapping, prove it didn't get worse. The bill went down and our customers haven't noticed — which is the highest compliment.
CTO · userevaluation.com

Got an AI pipeline that's leaking money?

Book a 30-min call tash@neu.ie