The Brief

Volvo Group's fleet management division operates at a scale where the difference between reactive and predictive maintenance is a measurable financial event. A truck taken offline unexpectedly costs more than a truck serviced proactively — not just in parts and labor, but in logistics disruption, delivery delays, and customer penalties.

The engagement was on the ML and IoT data engineering side: building the systems that ingest sensor data from connected vehicles, detect anomalies, and predict maintenance needs before failures occur.

What Was Delivered

Streaming telemetry pipeline on Apache Kafka ingesting sensor data across the vehicle fleet
Feature engineering running in the stream processing layer before model inference — rolling statistics, frequency-domain features from vibration sensors, cross-signal correlations
Anomaly detection models (gradient-boosted trees trained on historical failure data) converted to ONNX format for edge deployment
Over-the-air model update mechanism with rolling deployment and automatic rollback on performance regression
15% reduction in fleet downtime measured by Volvo's data team after production deployment

The Approach

Predictive maintenance for heavy vehicles is not a novel problem domain. What made this engagement technically interesting was the constraint: inference needed to happen at the edge — on the vehicle, or in roadside compute infrastructure — not in a central cloud.

The fleet operates in areas with intermittent connectivity. A maintenance alert that requires a cloud round-trip to generate is less useful than an alert generated locally and immediately. Time-critical warnings (imminent failure) need to fire regardless of network state.

This pushed the model architecture toward efficiency. The research models used in development were large; the production models needed to fit in embedded compute with severe memory and power constraints.

"The best model for production is not the most accurate model. It is the most accurate model that fits your constraints."

The Build

Streaming pipeline. Each vehicle generates telemetry continuously: engine temperature, oil pressure, transmission torque, brake pad sensors, dozens of other signals. The volume across the fleet is substantial.

Apache Kafka handled streaming ingestion. Feature engineering — rolling statistics, frequency-domain features derived from vibration sensors, cross-signal correlations — ran in the stream processing layer before the data reached model inference. Precomputing features in the stream rather than at query time was essential for hitting the latency requirements.

Anomaly detection models. Gradient-boosted trees trained on historical failure data. Class imbalance is severe in maintenance prediction (most of the time, nothing is wrong), handled via careful sampling strategy. Models were converted to ONNX format for deployment to edge devices — ONNX runtime on embedded hardware is significantly more efficient than running full training frameworks.

Edge deployment. A custom update mechanism: models were versioned, tested in simulation, and deployed to vehicles over-the-air during scheduled connectivity windows. Rolling deployment with automatic rollback on performance degradation was built from the start — not added later.

class EdgeModelUpdater:
    def deploy_update(
        self,
        model_version: str,
        vehicle_ids: list[str],
        rollout_pct: float = 0.1
    ) -> DeploymentResult:
        pilot_fleet = sample(vehicle_ids, int(len(vehicle_ids) * rollout_pct))
        result = self._deploy_to_fleet(model_version, pilot_fleet)
        if result.p95_accuracy < self.baseline_accuracy * 0.98:
            self._rollback(pilot_fleet)
            return DeploymentResult(status="rolled_back", reason="accuracy_regression")
        return result

The Outcome

The 15% reduction in fleet downtime came from Volvo's data team analysis after the systems were in production. The mechanism was clear: predictive alerts gave logistics operators enough lead time to schedule proactive maintenance during planned downtime windows rather than responding to unexpected failures mid-route.

The edge inference system handled the connectivity constraint as designed. Alerts generated in low-connectivity zones were queued locally and synchronized when connectivity was restored. Time-critical alerts triggered local operator notifications immediately, without requiring a cloud round-trip.

Lessons

IoT data is messier than the sensor datasheets suggest. Calibration drift, sensor failures, communication errors, and vehicles operating outside their designed envelope all produce data that will confuse models trained on clean data. The data cleaning and anomaly filtering layer is not a preprocessing detail — it is half the engineering work.

Edge deployment requires treating the device fleet as a distributed system, not as N independent devices. Rollout strategy, version tracking, and automated rollback are not optional features; they are the difference between a controlled update and a fleet-wide incident.

Kubernetes for training infrastructure, edge compute for inference — the split is correct. Do not run training jobs on edge hardware.

Volvo Group