The Brief
Volvo Group's fleet management division operates at a scale that makes the difference between reactive and predictive maintenance a measurable financial event. A truck taken offline unexpectedly costs more than a truck serviced proactively — not just in parts and labor, but in logistics disruption, delivery delays, and customer penalties.
The engagement was on the ML and IoT data engineering side: building the systems that ingest sensor data from connected vehicles, detect anomalies, and predict maintenance needs before failures occur.
The Approach
Predictive maintenance for heavy vehicles is not a novel problem domain. What made this engagement technically interesting was the constraint: inference needed to happen at the edge — on the vehicle, or in roadside compute infrastructure — not in a central cloud. The fleet operates in areas with intermittent connectivity. A maintenance alert that requires a cloud round-trip to generate is less useful than an alert generated locally.
This pushed the model architecture toward efficiency. The research models used in development were large; the production models needed to fit in embedded compute with severe memory and power constraints.
"The best model for production is not the most accurate model. It is the most accurate model that fits your constraints."
The Build
The data pipeline was built on a streaming architecture. Each vehicle generates telemetry continuously: engine temperature, oil pressure, transmission torque, brake pad sensors, dozens of other signals. The volume across the fleet is substantial.
Apache Kafka handled the streaming ingestion. Feature engineering — rolling statistics, frequency-domain features derived from vibration sensors, cross-signal correlations — ran in the stream processing layer before the data reached model inference. Precomputing features in the stream rather than at query time was essential for hitting the latency requirements.
The anomaly detection models were gradient-boosted trees trained on historical failure data, with class imbalance handled via careful sampling strategy. (Class imbalance is severe in maintenance prediction: most of the time, nothing is wrong.) The models were converted to ONNX format for deployment to edge devices.
The edge deployment infrastructure used a custom update mechanism — models were versioned, tested in simulation, and deployed to vehicles over the air during scheduled connectivity windows. Rolling deployment with automatic rollback on performance degradation was built from the start.
class EdgeModelUpdater:
def deploy_update(
self,
model_version: str,
vehicle_ids: list[str],
rollout_pct: float = 0.1
) -> DeploymentResult:
pilot_fleet = sample(vehicle_ids, int(len(vehicle_ids) * rollout_pct))
result = self._deploy_to_fleet(model_version, pilot_fleet)
if result.p95_accuracy < self.baseline_accuracy * 0.98:
self._rollback(pilot_fleet)
return DeploymentResult(status="rolled_back", reason="accuracy_regression")
return result
The Outcome
The 15% reduction in fleet downtime is a number that came from the Volvo data team's analysis after the systems were in production. The mechanism was clear: predictive alerts gave logistics operators enough lead time to schedule proactive maintenance during planned downtime windows rather than responding to unexpected failures.
The edge inference system handled the connectivity constraint as designed. Alerts generated in low-connectivity zones were queued locally and synchronized when connectivity was restored, while time-critical alerts (imminent failure detection) triggered local operator notifications.
Lessons
IoT data is messier than the sensor datasheets suggest. Calibration drift, sensor failures, communication errors, and vehicles operating outside their designed envelope all produce data that will confuse models trained on clean data. The data cleaning and anomaly filtering layer is not a preprocessing detail — it is half the engineering work.
Edge deployment requires treating the device fleet as a distributed system, not as N independent devices. Rollout strategy, version tracking, and automated rollback are not optional features; they are the difference between a controlled update and a fleet-wide incident.
Kubernetes for training infrastructure, edge compute for inference — the split is correct. Do not run training jobs on edge hardware.