Harness engineering

The scaffolding that makes agents reliable.

Evals, orchestration, observability, and guardrails. The infrastructure you wish your AI team had already built. We install it, ship the first production system on top, and hand you the runbook.

Retainer from $24k/mo · project from $120k · diagnostic from $18k

Executive summary

The difference between a demo and a system your business runs on.

A harness is the scaffolding around an AI agent — the evals that say "ship it," the observability that catches drift, the orchestration that keeps retries from melting your budget. Most AI projects skip this. Most AI projects also get quietly shelved six months later. We build the harness first, so the agents you put in production stay there.

99.6%

agent uptime after a harness rebuild

median, last 8 engagements

-58%

inference cost vs. pre-engagement

same task volume, same quality bar

4.2×

faster to catch regressions

time-to-detect, per eval suite

production agents we've run without evals

we won't ship blind

What we build

Six layers of the stack.

Each engagement starts with whichever is weakest. Most teams ask for evals and discover they also need orchestration.

Evaluation suites

Graded datasets, LLM-as-judge with calibrated rubrics, regression tests that run on every commit and every model upgrade.

Observability

Span-level tracing for every tool call, prompt, and retry. Cost, latency, and quality surfaced on one dashboard your on-call engineer can actually read.

Orchestration

Deterministic control flow around non-deterministic models. Retry strategies, fallback chains, circuit breakers, and budget caps.

Guardrails

Input validation, output schemas, PII redaction, and policy enforcement — layered so no single failure mode reaches the customer.

Prompt & tool versioning

Every prompt, every tool schema, every eval run is versioned, diffable, and rollback-able. Deploys feel like deploys, not prayers.

Drift detection

When model behavior shifts — provider side, temperature side, your side — you know within an hour, not a quarter.

Architecture

A typical APEX harness.

Shape changes per engagement. The layers don't.

Client surface

Your app, your users

Orchestrator

Routing, retries, fallbacks, budget

Guardrail layer

Input validation · output schemas · policy

Agent core

Prompts · tools · memory

Observability plane

Spans · costs · quality · drift alerts

Eval harness

Graded datasets · LLM judges · regression gates

→ built on your stack (Python / TS / Go)→ model-agnostic (Anthropic · OpenAI · open-weight)→ observability exports to Datadog · Grafana · Honeycomb

For engineers

An eval that runs on every commit.

One-liner at the top of your PR check. Evals are graded against your labeled dataset; regressions gate merge. No prompt gets pushed to main without passing the same bar your production run held yesterday.

Everything we build is yours. No proprietary SDK lock-in — just Python (or Go, or TS) your team can read, extend, and audit.

Read the eval cookbook →

evals/triage.py

from apex import Suite, judge, gate

# Load 412 labeled production incidents
suite = Suite.from_jsonl("data/incidents.jsonl")

@suite.check(threshold=0.95)
def routes_correctly(case, agent_output):
    return agent_output.route == case.expected_route

@suite.check(threshold=0.99)
def no_hallucinated_systems(case, agent_output):
    return judge(
        agent_output.narrative,
        rubric="cites only systems in case.registry",
    )

# Fail the build if any check regresses
gate(suite.run(agent="triage@v7"))

Signals

When to call us.

You have an agent in production and nobody trusts it

Silent regressions, customer-facing hallucinations, inference bills that climb without explanation.

Your evals are vibes

You ship prompts based on a handful of manual checks. You know this is bad. You don't have time to fix it.

You're about to go from one agent to five

Orchestration, shared observability, and unified eval infra before the portfolio gets unmanageable.

A provider upgrade broke you last quarter

Model behavior shifted, evals didn't catch it, you found out from a customer. Harness prevents the next one.

Get started

Two ways in.

Book a call or read our writing. We don't run sales cadences — if there's a fit we'll say so.

Read our writing first