Platform
The operations layer.
Five capabilities a fleet needs after the first device ships — and what it takes to keep that fleet alive in the field. Built into the platform; not a bolt-on, not a vendor add-on, not the customer's problem.
Pillar 1
Observability
Live per micro service metrics, logs, traces, flame-graph profiles, and ready-made dashboards — Prometheus and OpenTelemetry compatible out of the box.
Automatic per micro service metrics
RPC timing, request and error counters, throughput, lifecycle events, status, uptime — no instrumentation code required.
Custom metric push + health score
Any micro service pushes custom metrics and a 0.0–1.0 health score via the in-process observer API.
Cloud-side periodic snapshot
A dedicated reporter micro service sends per-component and per-process records (CPU, memory, open files, threads, uptime, health status), plus a forced flush at idle, stop, and emergency stop.
Container Prometheus scrape
Every running container's metrics endpoint feeds the supervision pipeline via the lightweight container engine.
Distributed tracing (W3C Trace Context)
Request traces flow across micro services and processes — no application code change required.
Force-sync remote pull
Cloud or tooling pulls an immediate snapshot on demand, rate-limited.
Centralized log aggregation
Device logs stream to a central store — queryable per device and fleet-wide, alongside the same metrics, with no separate logging agent to deploy.
On-device flame-graph profiling
Sample stacks on a running device and read them back as flame graphs to see where time goes — in the field, without a special build.
Ready-made dashboards
Per-device and fleet-wide dashboards are generated from the service catalog, so a device is observable the moment it boots — no dashboard authoring.
Why it matters. A fleet device with a hundred independent micro services emits a coherent Prometheus-compatible metric set within minutes of boot — with logs, flame-graph profiles, and ready-made dashboards built in. Use the dashboards that ship with the platform, or the ones the cloud team already owns.
How the registry and observer fit in the platform architecture →
Pillar 2
Safety
Four supervision tiers and a deterministic escalation chain. A hung task, a hung process, or a hung kernel cannot lose the device.
Watchdog supervision flow. A micro service task pets its in-process HealthMonitor checkpoint; process pings flow into the watchdog policy micro service alongside the HealthMonitor; the policy emits warn, restart, or emergency stop to the registry, and pets the Linux hardware watchdog; on stall the hardware forces a system reboot.
graph TD
A[Micro service task] -->|pet checkpoint| B[In-process HealthMonitor]
C[Process pings] --> D[Watchdog policy micro service]
B --> D
D -->|warn / restart / emergency stop| E[Registry]
D -->|pet| F[Linux /dev/watchdog]
F -->|hardware reset on stall| G[System reboot] Tier 1 — In-process HealthMonitor
Each micro service declares named checkpoints and pets them from its task loop.
Tier 2 — Process pings
Primary and worker processes exchange health pings; degraded state surfaces to the policy.
Tier 3 — Watchdog policy micro service
30-second autonomous loop, per-component violation counter, deterministic action ladder warn → restart → emergency stop.
Tier 4 — Linux hardware watchdog
Timer petted at a fixed interval; if the software stack hangs, the hardware forces a reset.
Emergency-stop lifecycle hook
Every micro service gets a short cleanup window before reboot. Hooks called from the registry, not the application.
Issue aggregation
Identical issues collapse into deduplicated batches — periodic or burst-classified — before upload, so the cloud is never flooded with duplicates.
Energy referee
Any micro service can hold a token to defer a power transition until safe; the engine waits up to a configurable timeout for tokens to clear.
Why it matters. The escalation chain is auditable, testable on the bench, and runs without operator intervention. A field outage on a fleet at scale costs more than the platform itself.
Pillar 3
Over-the-air updates
A/B partitions, signed delta packages, automatic rollback. Pure Rust, no external daemon, runs on modem-class hardware.
A/B rootfs slots + bootcount
A failed boot reverts to the previous slot without operator action.
Delta packages
Content-defined chunking (only changed blocks are transferred), deduplication across multiple source trees, compressed for modem-class bandwidth.
Cryptographic chain
Signed manifest verified before any flash write; optional payload encryption; per-file and full-image integrity checks.
Persistent retry + reboot counters
Per-file retry limit, per-update retry limit, per-update reboot limit — survive power cycles, prevent infinite reboot loops on flaky updates.
Two transport entries
Cloud command channel and direct micro service call.
Two binaries from one library
A host-side generator and an on-device applier; format changes propagate atomically.
Why it matters. Shipping firmware is the easy part; rolling it back without a truck roll is the hard one.
Devices can subscribe to multiple update channels at once — fleet-wide, region-specific, and customer-specific can run side by side. The runtime surfaces the active channel list through the standard update status contract, so an integrator can confirm which channels a device is currently honouring.
Cloud-loop integration tests cover the five OTA scenarios that matter most in the field: happy-path, mid-flight cancellation, invalid image, download failure, and resume-on-reboot. They run on every change so an OTA pipeline cannot ship a regression without being caught.
Walk an OTA rollout with us, end to end.
Pillar 4
Remote control
A single channel addresses every micro service on the device. Live signals, live logs, live diagnostic sessions.
Cloud-callable micro services
Any registered micro services is callable from the cloud or from tooling, JSON or protobuf passthrough, runtime catalog query, JSON-RPC 2.0 errors, optional per-call latency stats.
Remote log retrieval
Time-range filter plus text pattern, streamed as compressed chunks. No SSH, no SCP, no device-side tooling.
On-demand observability snapshot
Pull an immediate snapshot before issuing a command or update; rate-limited.
Vehicle protocol surface via Multi Stacks
Live read-out of any CAN, CAN-FD, DoIP, ISOBUS, J1939, J1708, J1850, J1587, HD-OBD, Modbus, RS-485 signal as a micro service topic.
Why it matters. The after-sales workflow is practical because every device on the fleet exposes the same remote-control surface.
Pillar 5
Lifecycle
Every state change is deliberate. Reboots are explained, idles are overridable, runtime upgrades roll fleet-wide without customer code changes.
Boot-reason + wake-reason registers
Exposed to every micro service (HardReset, SoftReset, WatchdogReset, Ignition, Alarm, RTC, CAN). Field tooling and cloud surface read these to root-cause every reboot.
Override-next-idle
The registry can substitute the next natural idle with a controlled reboot or shutdown — no surprise restarts in the field.
Max-uptime guard
Two-threshold reboot model for long-running devices — prevents memory fragmentation build-up; soft limit reboots at the next idle, hard limit forces reboot, loop-guard breaks reboot-idle cycles.
System layers
Operator-pinned shared runtime layers. A fleet-wide runtime upgrade is a single digest bump in operator config; customer application containers are not rebuilt or resubmitted.
Wake-event configuration
Configurable wake mask (ignition, alarm, RTC) so the device sleeps for power and wakes for the right signals on platforms that support it.
Why it matters. Fleet operations is rarely about catastrophic failure; it is about coherent answers to "why did this device reboot last night?" and "how do I roll a runtime security patch to ten thousand customer containers without touching customer code?".
Cross-cutting capability
Remote Care — the fleet-uptime view
The five pillars above are how the operations layer is structured per concern. The same surface, viewed end-to-end as remediation, is the Remote Care page — built for the fleet-ops audience that thinks in immobilizations avoided, not pillars.
When the box stalls, you don't send a truck. You send a packet.
Remote Care is the cross-cutting view of the operations layer — three remediation tiers (diagnose, reconfigure, remediate), a predictive layer on the device, and an audit log on every action. Built-in OS capability; not a separate SaaS.
Catalog
The seven catalog cards behind the layer
Every claim above maps to a micro service. Drill into the catalog for interfaces, source spec, and integration touchpoints.
Platform architecture — how registry + lifecycle tie together →
Safety
Watchdog
Four-tier supervision policy micro service. 30-second autonomous loop; deterministic escalation; hardware watchdog integration.
Observability
Observer
Periodic cloud snapshot reporter. Per-component and per-process records; forced flush at lifecycle transitions; force-sync RPC.
Safety
Sentry
Issue aggregation across all micro services. Startup flood guard; burst and periodic classification; deduplication before upload.
Remote control
Logs
Remote log retrieval. Time-range and text filter; streamed compressed chunks; no SSH, no SCP.
OTA
Update
A/B partition OTA. Signed delta packages; content-defined chunking; cryptographic chain; retry and rollback counters.
Lifecycle
Containers
Lightweight container engine. System-layer substitution; Prometheus scrape per container; Linux kernel resource isolation per container.
Safety + Lifecycle
Power
Energy referee; max-uptime guard; wake-event configuration; override-next-idle; boot and wake reason registers.
Questions
The five questions integrators actually ask
What comes up on every technical discovery call.
-
Is this a rebrand of compliance, observability, and OTA?
No. Compliance, observability, and OTA each remain their own pages and concepts. The operations layer is the umbrella — what the platform gives a fleet on day two — and links into each of those pages where the depth lives.
-
Do I have to use every pillar?
No. Each pillar is independent. A team can ship with just the safety supervision and the OTA chain on day one, and turn on remote control and the observability snapshot when the cloud back end is ready.
-
Does this lock me into Munic-hosted cloud services?
No. The observability snapshot is Prometheus- and OpenTelemetry-compatible; the OTA chain runs against any HTTP, HTTPS, or SFTP endpoint; the remote-control surface speaks a documented JSON schema. Cloud back ends are interchangeable.
-
How is this different from running Prometheus + Grafana + a custom OTA service in containers?
It is not a different stack. It is the same protocols, prepackaged, with the device-side glue already in place: hardware watchdog integration, A/B partition flow, signed-package verification, lifecycle-flushed snapshots, deterministic escalation. The bring-up is what is shipped.
-
How do you keep the page honest as the platform evolves?
Every claim on this page links to its source micro service doc; CI fails on stale or missing source citations. The micro service catalog is the canonical truth; this page is a navigable index over it.
Operate your fleet on MOS4.
A 30-minute discovery call with engineering — walk through your day-2 cost line and see which pillars cut it the most.