AI · SaaS · user research2026Engagement: ongoingLive: production

Cheaper inference, faster reports.

User Evaluation runs AI research analysis for product teams at Shopify, Samsung, SAP and Tencent. Behind every customer interaction sits a pipeline — transcription, multimodal chat, automated reports — and at scale that pipeline carries a real bill. We work on the layer that makes it cheaper and faster without softening what it does.

38%

HeadlineDrop in per-job inference cost across the synthesis pipeline, measured against the same workload month-on-month with their evals held flat.

01 — The brief

Volume kept climbing. The bill noticed.

The platform processes hours of audio, video and text per session. Every step is an LLM or transcription call. Every call has a price tag and a tail. As traffic grew, two things stopped being negotiable: cost per job and tail latency on synthesis.

The team had a working pipeline and a long backlog. They didn't need a rewrite. They needed someone to do the unglamorous middle layer — measure every call, decide what to swap, and prove the swap didn't regress quality on their own benchmarks.

02 — What we did

Audit. Swap. Measure. Repeat.

We started with a full audit — every step in the pipeline, with cost and latency attached to each one. Most of the bill sat in two places. We worked on those.

  • Per-task modelsSwapped models step by step where the output held up under their internal evals — keeping the heavy ones only where they earned their keep.
  • Batched transcriptionMoved transcription to batch processing where the SLA allowed; same content, half the unit cost.
  • Prompt compressionTightened the heavy synthesis prompts and moved retrieval into a cache for repeated lookups across a session.
  • Eval harnessBuilt a regression harness so every change ran against their internal quality bar before it shipped near production.
  • ObservabilityPer-call cost and latency telemetry the team could read at a glance, so the gains don't quietly drift back.

OpenAI·Anthropic·Whisper·Postgres·their existing infra·neu eval harness

03 — Result

Lower bill. Same answers. Faster.

38%Drop in per-job inference cost on the synthesis pipeline.
2.1×Faster median end-to-end synthesis at the new model mix.
0%Regression on their internal evals across the optimised paths.

The work continues. Each new model release opens another small swap; the eval harness keeps the floor honest.

Boring work, important work. Measure every call, decide what's worth swapping, prove it didn't get worse. The bill went down and our customers haven't noticed — which is the highest compliment.

CTO · userevaluation.com

Next

Got an AI pipeline that's leaking money?