Work · AI Character Platform

Consumer AI

AI Character Platform

Active retainer with a venture-backed AI company building character and conversation products.

Role

Contract AI Engineer

Timeline

2025–Present

Stack

4 technologies

Confidentiality

Under NDA

Stack:LLMsReal-time chatEvaluation pipelinesObservability

The Brief

Details of this engagement are under a non-disclosure agreement. What can be shared is the nature of the work and the technical direction, without identifying the specific product, features, or any proprietary architecture.

The client is a venture-backed AI company — funded and operating. The product involves character-driven conversation: AI systems that maintain consistent personality, memory, and tone across extended interactions. This is technically harder than it looks.

The Approach

The central challenge in character AI is not making an AI that can impersonate a personality. Modern LLMs do this adequately out of the box. The challenge is making an AI that maintains that personality consistently across:

The work on this engagement addresses all four of these. I cannot describe the specific technical solutions, but the evaluation methodology — how we measure character consistency and identify regressions — is where the most original work has been done.

"The hardest problem in character AI is not generation. It is evaluation."

The Build

Real-time chat infrastructure at scale requires different architectural choices than batch LLM processing. Latency budgets are strict, streaming is non-negotiable, and partial failure (one message in a conversation fails) needs graceful handling that does not break the user experience.

The evaluation pipeline is the layer I am most proud of. Automated evaluation of qualitative properties — does this response feel like this character? — requires moving beyond simple automated metrics toward structured human-in-the-loop workflows combined with LLM-as-judge scoring. Getting the calibration right between automated signals and human judgment is the real work.

Production observability for LLM systems is an unsolved problem at most companies. Standard APM tools do not capture what matters: token efficiency, prompt version performance, character consistency scores over time, latency distribution by model and prompt template. Building the right dashboards for LLM systems requires knowing what to measure, which is not obvious.

The Outcome

The engagement is active. Specific metrics are under NDA. The work is shipping to real users.

What I can say: the evaluation infrastructure built here applies beyond this engagement. The methodology for measuring LLM system quality in production — not just benchmark performance, but operational quality — is something I now bring to every new engagement.

Lessons

Character AI exposes a problem with standard prompt engineering practices: prompts that work in development fail in production because the distribution of real user messages is nothing like the distribution of test messages you wrote yourself.

The solution is a continuous evaluation loop: collect real production conversations, score them against your quality criteria, find the failure modes, and update the prompts. This loop needs to be fast — ideally same-day — or the product degrades faster than you can fix it.

Ship the evaluation system before you need it. By the time you notice degradation in production, you are already behind.

Next Engagement

Kandji

View Case Study →