Is this a rebrand of compliance, observability, and OTA?

No. Compliance, observability, and OTA each remain their own pages and concepts. The operations layer is the umbrella — what the platform gives a fleet on day two — and links into each of those pages where the depth lives.

Do I have to use every pillar?

No. Each pillar is independent. A team can ship with just the safety supervision and the OTA chain on day one, and turn on remote control and the observability snapshot when the cloud back end is ready.

Does this lock me into Munic-hosted cloud services?

No. The observability snapshot is Prometheus- and OpenTelemetry-compatible; the OTA chain runs against any HTTP, HTTPS, or SFTP endpoint; the remote-control surface speaks a documented JSON schema. Cloud back ends are interchangeable.

How is this different from running Prometheus + Grafana + a custom OTA service in containers?

It is not a different stack. It is the same protocols, prepackaged, with the device-side glue already in place: hardware watchdog integration, A/B partition flow, signed-package verification, lifecycle-flushed snapshots, deterministic escalation. The bring-up is what is shipped.

How do you keep the page honest as the platform evolves?

Every claim on this page links to its source micro service doc; CI fails on stale or missing source citations. The micro service catalog is the canonical truth; this page is a navigable index over it.

Platform

The operations layer.

Five capabilities a fleet needs after the first device ships — and what it takes to keep that fleet alive in the field. Built into the platform; not a bolt-on, not a vendor add-on, not the customer's problem.

Talk to engineering See the catalog

Watchdog supervision flow — four supervision tiers from in-process HealthMonitor to Linux hardware watchdog, illustrated as a layered cluster diagram.

Pillar 1

Observability

Live per micro service metrics, logs, traces, flame-graph profiles, and ready-made dashboards — Prometheus and OpenTelemetry compatible out of the box.

Automatic per micro service metrics

RPC timing, request and error counters, throughput, lifecycle events, status, uptime — no instrumentation code required.

Custom metric push + health score

Any micro service pushes custom metrics and a 0.0–1.0 health score via the in-process observer API.

Cloud-side periodic snapshot

A dedicated reporter micro service sends per-component and per-process records (CPU, memory, open files, threads, uptime, health status), plus a forced flush at idle, stop, and emergency stop.

Container Prometheus scrape

Every running container's metrics endpoint feeds the supervision pipeline via the lightweight container engine.

Distributed tracing (W3C Trace Context)

Request traces flow across micro services and processes — no application code change required.

Force-sync remote pull

Cloud or tooling pulls an immediate snapshot on demand, rate-limited.

Centralized log aggregation

Device logs stream to a central store — queryable per device and fleet-wide, alongside the same metrics, with no separate logging agent to deploy.

On-device flame-graph profiling

Sample stacks on a running device and read them back as flame graphs to see where time goes — in the field, without a special build.

Ready-made dashboards

Per-device and fleet-wide dashboards are generated from the service catalog, so a device is observable the moment it boots — no dashboard authoring.

Why it matters. A fleet device with a hundred independent micro services emits a coherent Prometheus-compatible metric set within minutes of boot — with logs, flame-graph profiles, and ready-made dashboards built in. Use the dashboards that ship with the platform, or the ones the cloud team already owns.

How the registry and observer fit in the platform architecture →

See the operational micro services →

Pillar 2

Safety

Four supervision tiers and a deterministic escalation chain. A hung task, a hung process, or a hung kernel cannot lose the device.

graph TD
      A[Micro service task] -->|pet checkpoint| B[In-process HealthMonitor]
      C[Process pings] --> D[Watchdog policy micro service]
      B --> D
      D -->|warn / restart / emergency stop| E[Registry]
      D -->|pet| F[Linux /dev/watchdog]
      F -->|hardware reset on stall| G[System reboot]

Four supervision tiers — every layer has a deterministic action.

Tier 1 — In-process HealthMonitor

Each micro service declares named checkpoints and pets them from its task loop.

Tier 2 — Process pings

Primary and worker processes exchange health pings; degraded state surfaces to the policy.

Tier 3 — Watchdog policy micro service

30-second autonomous loop, per-component violation counter, deterministic action ladder warn → restart → emergency stop.

Tier 4 — Linux hardware watchdog

Timer petted at a fixed interval; if the software stack hangs, the hardware forces a reset.

Emergency-stop lifecycle hook

Every micro service gets a short cleanup window before reboot. Hooks called from the registry, not the application.

Issue aggregation

Identical issues collapse into deduplicated batches — periodic or burst-classified — before upload, so the cloud is never flooded with duplicates.

Energy referee

Any micro service can hold a token to defer a power transition until safe; the engine waits up to a configurable timeout for tokens to clear.

Why it matters. The escalation chain is auditable, testable on the bench, and runs without operator intervention. A field outage on a fleet at scale costs more than the platform itself.

See the operational micro services →

Pillar 3

Over-the-air updates

A/B partitions, signed delta packages, automatic rollback. Pure Rust, no external daemon, runs on modem-class hardware.

A/B rootfs slots + bootcount

A failed boot reverts to the previous slot without operator action.

Delta packages

Content-defined chunking (only changed blocks are transferred), deduplication across multiple source trees, compressed for modem-class bandwidth.

Cryptographic chain

Signed manifest verified before any flash write; optional payload encryption; per-file and full-image integrity checks.

Persistent retry + reboot counters

Per-file retry limit, per-update retry limit, per-update reboot limit — survive power cycles, prevent infinite reboot loops on flaky updates.

Two transport entries

Cloud command channel and direct micro service call.

Two binaries from one library

A host-side generator and an on-device applier; format changes propagate atomically.

Why it matters. Shipping firmware is the easy part; rolling it back without a truck roll is the hard one.

Devices can subscribe to multiple update channels at once — fleet-wide, region-specific, and customer-specific can run side by side. The runtime surfaces the active channel list through the standard update status contract, so an integrator can confirm which channels a device is currently honouring.

Cloud-loop integration tests cover the five OTA scenarios that matter most in the field: happy-path, mid-flight cancellation, invalid image, download failure, and resume-on-reboot. They run on every change so an OTA pipeline cannot ship a regression without being caught.

See the operational micro services →

Walk an OTA rollout with us, end to end.

Talk to engineering

Pillar 4

Remote control

A single channel addresses every micro service on the device. Live signals, live logs, live diagnostic sessions.

Cloud-callable micro services

Any registered micro services is callable from the cloud or from tooling, JSON or protobuf passthrough, runtime catalog query, JSON-RPC 2.0 errors, optional per-call latency stats.

Remote log retrieval

Time-range filter plus text pattern, streamed as compressed chunks. No SSH, no SCP, no device-side tooling.

On-demand observability snapshot

Pull an immediate snapshot before issuing a command or update; rate-limited.

Vehicle protocol surface via Multi Stacks

Live read-out of any CAN, CAN-FD, DoIP, ISOBUS, J1939, J1708, J1850, J1587, HD-OBD, Modbus, RS-485 signal as a micro service topic.

Why it matters. The after-sales workflow is practical because every device on the fleet exposes the same remote-control surface.

Read the remote-diagnostics page → RPC bridge and SDK documentation →

Pillar 5

Lifecycle

Every state change is deliberate. Reboots are explained, idles are overridable, runtime upgrades roll fleet-wide without customer code changes.

Boot-reason + wake-reason registers

Exposed to every micro service (HardReset, SoftReset, WatchdogReset, Ignition, Alarm, RTC, CAN). Field tooling and cloud surface read these to root-cause every reboot.

Override-next-idle

The registry can substitute the next natural idle with a controlled reboot or shutdown — no surprise restarts in the field.

Max-uptime guard

Two-threshold reboot model for long-running devices — prevents memory fragmentation build-up; soft limit reboots at the next idle, hard limit forces reboot, loop-guard breaks reboot-idle cycles.

System layers

Operator-pinned shared runtime layers. A fleet-wide runtime upgrade is a single digest bump in operator config; customer application containers are not rebuilt or resubmitted.

Wake-event configuration

Configurable wake mask (ignition, alarm, RTC) so the device sleeps for power and wakes for the right signals on platforms that support it.

Why it matters. Fleet operations is rarely about catastrophic failure; it is about coherent answers to "why did this device reboot last night?" and "how do I roll a runtime security patch to ten thousand customer containers without touching customer code?".

Compliance and lifecycle posture → Config-driven ops policies with MEP →

Cross-cutting capability

Remote Care — the fleet-uptime view

The five pillars above are how the operations layer is structured per concern. The same surface, viewed end-to-end as remediation, is the Remote Care page — built for the fleet-ops audience that thinks in immobilizations avoided, not pillars.

When the box stalls, you don't send a truck. You send a packet.

Remote Care is the cross-cutting view of the operations layer — three remediation tiers (diagnose, reconfigure, remediate), a predictive layer on the device, and an audit log on every action. Built-in OS capability; not a separate SaaS.

Read Remote Care →

Catalog

The seven catalog cards behind the layer

Every claim above maps to a micro service. Drill into the catalog for interfaces, source spec, and integration touchpoints.

Platform architecture — how registry + lifecycle tie together →

Safety

Watchdog

Four-tier supervision policy micro service. 30-second autonomous loop; deterministic escalation; hardware watchdog integration.

Catalog →

Observability

Observer

Periodic cloud snapshot reporter. Per-component and per-process records; forced flush at lifecycle transitions; force-sync RPC.

Catalog →

Safety

Sentry

Issue aggregation across all micro services. Startup flood guard; burst and periodic classification; deduplication before upload.

Catalog →

Remote control

Logs

Remote log retrieval. Time-range and text filter; streamed compressed chunks; no SSH, no SCP.

Catalog →

OTA

Update

A/B partition OTA. Signed delta packages; content-defined chunking; cryptographic chain; retry and rollback counters.

Catalog →

Lifecycle

Containers

Lightweight container engine. System-layer substitution; Prometheus scrape per container; Linux kernel resource isolation per container.

Catalog →

Safety + Lifecycle

Power

Energy referee; max-uptime guard; wake-event configuration; override-next-idle; boot and wake reason registers.

Catalog →

See all seven cards in the catalog →

Questions

The five questions integrators actually ask

What comes up on every technical discovery call.

Is this a rebrand of compliance, observability, and OTA?

No. Compliance, observability, and OTA each remain their own pages and concepts. The operations layer is the umbrella — what the platform gives a fleet on day two — and links into each of those pages where the depth lives.
Do I have to use every pillar?

No. Each pillar is independent. A team can ship with just the safety supervision and the OTA chain on day one, and turn on remote control and the observability snapshot when the cloud back end is ready.
Does this lock me into Munic-hosted cloud services?

No. The observability snapshot is Prometheus- and OpenTelemetry-compatible; the OTA chain runs against any HTTP, HTTPS, or SFTP endpoint; the remote-control surface speaks a documented JSON schema. Cloud back ends are interchangeable.
How is this different from running Prometheus + Grafana + a custom OTA service in containers?

It is not a different stack. It is the same protocols, prepackaged, with the device-side glue already in place: hardware watchdog integration, A/B partition flow, signed-package verification, lifecycle-flushed snapshots, deterministic escalation. The bring-up is what is shipped.
How do you keep the page honest as the platform evolves?

Every claim on this page links to its source micro service doc; CI fails on stale or missing source citations. The micro service catalog is the canonical truth; this page is a navigable index over it.

Operate your fleet on MOS4.

A 30-minute discovery call with engineering — walk through your day-2 cost line and see which pillars cut it the most.

Talk to engineering See the catalog

The operations layer.

Observability

Automatic per micro service metrics

Custom metric push + health score

Cloud-side periodic snapshot

Container Prometheus scrape

Distributed tracing (W3C Trace Context)

Force-sync remote pull

Centralized log aggregation

On-device flame-graph profiling

Ready-made dashboards

Safety

Tier 1 — In-process HealthMonitor

Tier 2 — Process pings

Tier 3 — Watchdog policy micro service

Tier 4 — Linux hardware watchdog

Emergency-stop lifecycle hook

Issue aggregation

Energy referee

Over-the-air updates

A/B rootfs slots + bootcount

Delta packages

Cryptographic chain

Persistent retry + reboot counters

Two transport entries

Two binaries from one library

Walk an OTA rollout with us, end to end.

Remote control

Cloud-callable micro services

Remote log retrieval

On-demand observability snapshot

Vehicle protocol surface via Multi Stacks

Lifecycle

Boot-reason + wake-reason registers

Override-next-idle

Max-uptime guard

System layers

Wake-event configuration

Remote Care — the fleet-uptime view

When the box stalls, you don't send a truck. You send a packet.

The seven catalog cards behind the layer

Watchdog

Observer

Sentry

Logs

Update

Containers

Power

The five questions integrators actually ask

Is this a rebrand of compliance, observability, and OTA?

Do I have to use every pillar?

Does this lock me into Munic-hosted cloud services?

How is this different from running Prometheus + Grafana + a custom OTA service in containers?

How do you keep the page honest as the platform evolves?

Operate your fleet on MOS4.

Building on MOS4?