The challenge
The insurer's claims team handled simple claims in days and complex claims in weeks. The bottleneck was triage — classifying incoming claims, requesting missing documents, and routing to the right adjuster. Their previous attempt at automation was a rules engine that covered ~38% of claim types and produced misclassifications on the rest.
The brief was clear: an agentic system that could triage, request information, and route — but with human-in-the-loop on every payout decision and a SOC 2 Type II audit posture from day one.
The architecture
We chose LangGraph for orchestration, with durable state in Postgres, short-term memory in Redis, and S3-compatible object storage for claim artefacts. The agent's tool catalog was wrapped around the insurer's existing internal APIs — claim lookup, document classification, customer messaging — with typed JSON-schema interfaces, per-tool timeouts, retry budgets, and dead-letter queues for failed calls.
Tracing ran through Langfuse with extensions to capture the agent's plan, every tool call with its observation, and the cost of every model call. Every run was replayable: an operations engineer could walk back through any disposition and see exactly what the agent reasoned about, what it called, and what it observed.
HITL gates fired before any action that touched payment or external customer communication. The agent could classify, request documents, and route freely; it could not authorise a payout without an explicit operator approval.
The eval harness
The pre-deploy eval ran scenario-based tests across 180 archived claims spanning the full taxonomy of types. Each scenario had a known correct disposition and an expected tool-call sequence. LLM-as-judge scored plan quality (was the agent's reasoning sound?) and tool selection (did it pick the right tool at each step?). Regressions blocked deploy.
The compliance posture
SOC 2 Type II review ran in parallel with the build. The architecture was designed around the audit, not retrofitted: region-pinned model deployments, per-tenant audit logging, PII redaction in trace pipelines, and no-train guarantees on every external model call. Internal security review passed on first submission. SOC 2 audit completed two weeks after production launch.
The outcome
Time-to-disposition for simple claims fell from days to under one hour. Misclassification on the rules-engine-handled subset dropped to under 1.5%. SOC 2 Type II audit passed at first submission with zero findings related to the AI system.
The pod transitioned to a steady-state engagement at week thirteen, supporting the agent's expansion into a second claim line and on-call coverage for the trace pipeline.