The Brief
Details of this engagement are under a non-disclosure agreement. What can be shared is the nature of the work and the technical direction, without identifying the specific product, features, or any proprietary architecture.
The client is a venture-backed AI company — funded and operating. The product involves character-driven conversation: AI systems that maintain consistent personality, memory, and tone across extended interactions. This is technically harder than it looks.
The Approach
The central challenge in character AI is not making an AI that can impersonate a personality. Modern LLMs do this adequately out of the box. The challenge is making an AI that maintains that personality consistently across:
- Multiple conversation sessions (memory and continuity)
- Edge cases and adversarial prompts (robustness)
- Different user interaction styles (adaptability without drift)
- High-concurrency production load (scale without degradation)
The work on this engagement addresses all four of these. I cannot describe the specific technical solutions, but the evaluation methodology — how we measure character consistency and identify regressions — is where the most original work has been done.
"The hardest problem in character AI is not generation. It is evaluation."
The Build
Real-time chat infrastructure at scale requires different architectural choices than batch LLM processing. Latency budgets are strict, streaming is non-negotiable, and partial failure (one message in a conversation fails) needs graceful handling that does not break the user experience.
The evaluation pipeline is the layer I am most proud of. Automated evaluation of qualitative properties — does this response feel like this character? — requires moving beyond simple automated metrics toward structured human-in-the-loop workflows combined with LLM-as-judge scoring. Getting the calibration right between automated signals and human judgment is the real work.
Production observability for LLM systems is an unsolved problem at most companies. Standard APM tools do not capture what matters: token efficiency, prompt version performance, character consistency scores over time, latency distribution by model and prompt template. Building the right dashboards for LLM systems requires knowing what to measure, which is not obvious.
The Outcome
The engagement is active. Specific metrics are under NDA. The work is shipping to real users.
What I can say: the evaluation infrastructure built here applies beyond this engagement. The methodology for measuring LLM system quality in production — not just benchmark performance, but operational quality — is something I now bring to every new engagement.
Lessons
Character AI exposes a problem with standard prompt engineering practices: prompts that work in development fail in production because the distribution of real user messages is nothing like the distribution of test messages you wrote yourself.
The solution is a continuous evaluation loop: collect real production conversations, score them against your quality criteria, find the failure modes, and update the prompts. This loop needs to be fast — ideally same-day — or the product degrades faster than you can fix it.
Ship the evaluation system before you need it. By the time you notice degradation in production, you are already behind.