Work · Kandji

Enterprise DevOps

Kandji

Designed and shipped LLM-powered automation features and RAG pipelines for a leading enterprise device-management platform.

Role

Staff AI Engineer

Timeline

2024–2025

Stack

5 technologies

Stack:PythonRAGVector DBsMLOpsAWS

The Brief

Kandji is an enterprise Apple device management platform — the system IT administrators use to deploy, configure, and secure MacBooks and iPhones at scale. It is a deep technical product used by companies that take their endpoint security seriously.

The engagement was to bring LLM capabilities into the platform: automation features that understand natural language, a RAG system for the knowledge base, and the MLOps infrastructure to keep everything running in production.

The Approach

Enterprise AI features live in a different constraint space than consumer AI features. The users are IT professionals — technically sophisticated, time-poor, and intolerant of hallucinations. They are not forgiving of AI that sounds confident but gives wrong answers.

This shaped every technical decision. RAG retrieval quality was the obsession, not generation quality. A language model generating fluent nonsense is worse than no AI feature at all in an enterprise tool. The retrieval layer needed to be precise.

"In enterprise AI, a confident wrong answer is worse than no answer."

The vector database selection (Pinecone, evaluated against pgvector and Weaviate) came down to query performance at scale and the quality of metadata filtering. Kandji's knowledge base has fine-grained structure — device types, OS versions, deployment scenarios — and the retrieval system needed to respect that structure, not treat all documents as equivalent.

The Build

The automation features were built using a tool-calling architecture. The LLM orchestrates a set of well-defined tools (query device status, apply configuration profile, create enforcement rule) rather than generating free-form output. This is more reliable and more auditable than free-form generation — you can log exactly what the model decided to do and why.

The RAG pipeline was built with a retrieval evaluation harness from the start. Before writing the first vector embedding, I built a test set of question/answer pairs representative of real user queries, with manually annotated correct answers. Every change to the retrieval pipeline was measured against this benchmark. This sounds obvious; it is not common practice.

The MLOps layer used AWS SageMaker for model hosting and a custom monitoring stack built on CloudWatch and a purpose-built evaluation job that ran nightly. The monitoring tracked three things: retrieval precision (are we getting the right documents?), answer relevance (are the answers actually useful?), and user behavior signals (are people clicking away from AI-generated answers?).

class RetrievalEvaluator:
    def evaluate_batch(
        self,
        queries: list[str],
        expected_docs: list[list[str]]
    ) -> EvaluationReport:
        results = []
        for query, expected in zip(queries, expected_docs):
            retrieved = self.retriever.retrieve(query, top_k=5)
            precision = self.compute_precision(retrieved, expected)
            results.append(EvalResult(query=query, precision=precision))
        return EvaluationReport(results=results)

The Outcome

The LLM automation features shipped to production and are in active use. The RAG knowledge base improved support ticket deflection. The MLOps infrastructure reduced the time to detect and respond to model quality degradation from days to hours.

The retraining pipeline runs automatically when the monitoring system detects that answer relevance has dropped below threshold. This is the difference between an AI feature that starts good and stays good, and one that starts good and slowly degrades.

Lessons

RAG quality lives or dies on chunking strategy. Most RAG implementations split documents by character count. This is simple and wrong. Document structure matters: a paragraph that begins mid-sentence because the previous chunk ended there loses its meaning entirely.

Semantic chunking — splitting at logical boundaries (sections, paragraphs, procedure steps) — produces dramatically better retrieval. The improvement is not marginal. It is the kind of improvement that makes users stop noticing the retrieval layer.

The second lesson: retrieval evaluation is a product investment, not a technical nicety. The companies that ship good RAG are the companies that measure retrieval quality rigorously from day one.

Next Engagement

Acumatica

View Case Study →