The Brief

Kandji is an enterprise Apple device management platform — the system IT administrators use to deploy, configure, and secure MacBooks and iPhones at scale. It is a deep technical product used by companies that take their endpoint security seriously.

The engagement brought LLM capabilities into the platform: automation features that understand natural language, a RAG system for the knowledge base, and the MLOps infrastructure to keep everything running reliably in production.

What Was Delivered

LLM automation features using a tool-calling architecture — the model orchestrates well-defined actions rather than generating free-form output
RAG pipeline for the Kandji knowledge base, built with a retrieval evaluation harness from day one
Vector database deployment on Pinecone after evaluation against pgvector and Weaviate
MLOps monitoring stack on AWS SageMaker with nightly evaluation jobs
Retraining pipeline that triggers automatically when answer relevance drops below threshold
Model quality degradation detection reduced from days to hours

The Approach

Enterprise AI features live in a different constraint space than consumer AI features. The users are IT professionals — technically sophisticated, time-poor, and intolerant of hallucinations. They are not forgiving of AI that sounds confident but gives wrong answers.

This shaped every technical decision. RAG retrieval quality was the obsession, not generation quality. A language model generating fluent nonsense is worse than no AI feature at all in an enterprise tool.

"In enterprise AI, a confident wrong answer is worse than no answer."

The vector database selection came down to query performance at scale and the quality of metadata filtering. Kandji's knowledge base has fine-grained structure — device types, OS versions, deployment scenarios — and the retrieval system needed to respect that structure, not treat all documents as equivalent.

The Build

Automation features. Built using a tool-calling architecture. The LLM orchestrates a set of well-defined tools (query device status, apply configuration profile, create enforcement rule) rather than generating free-form output. This is more reliable and more auditable: every decision the model makes can be logged, reviewed, and explained.

RAG pipeline. Before writing the first vector embedding, a test set of question/answer pairs was built — representative of real user queries, with manually annotated correct answers. Every subsequent change to the retrieval pipeline was measured against this benchmark. Semantic chunking (splitting at logical document boundaries rather than character count) produced dramatically better retrieval than the naive approach.

MLOps layer. AWS SageMaker for model hosting. A custom monitoring stack built on CloudWatch and a purpose-built evaluation job that ran nightly. The monitoring tracked three signals: retrieval precision, answer relevance, and user behavior (are people clicking away from AI-generated answers?).

class RetrievalEvaluator:
    def evaluate_batch(
        self,
        queries: list[str],
        expected_docs: list[list[str]]
    ) -> EvaluationReport:
        results = []
        for query, expected in zip(queries, expected_docs):
            retrieved = self.retriever.retrieve(query, top_k=5)
            precision = self.compute_precision(retrieved, expected)
            results.append(EvalResult(query=query, precision=precision))
        return EvaluationReport(results=results)

The Outcome

The LLM automation features shipped to production and are in active use. The RAG knowledge base improved support ticket deflection. The MLOps infrastructure reduced the time to detect and respond to model quality degradation from days to hours.

The retraining pipeline runs automatically when the monitoring system detects that answer relevance has dropped below threshold. This is the difference between an AI feature that starts good and stays good, versus one that starts good and slowly degrades unnoticed.

Lessons

RAG quality lives or dies on chunking strategy. Most RAG implementations split documents by character count. This is simple and wrong. Document structure matters: a paragraph that begins mid-sentence because the previous chunk ended there loses its meaning entirely.

Semantic chunking — splitting at logical boundaries (sections, paragraphs, procedure steps) — produces dramatically better retrieval. The improvement is not marginal. It is the kind of improvement that makes users stop noticing the retrieval layer at all.

The second lesson: build the retrieval evaluation harness before you write a single embedding. The companies that ship good RAG are the companies that measure retrieval quality rigorously from day one — not as an afterthought when users start complaining.

Kandji