Do pixel bytes ever cross the CPU in the hot path?

No, by design. Camera, GPU, and NPU exchange the same shared-memory frame via handle passing — no buffer copies at any step.

Which model formats does the AI runtime accept?

TFLite directly. ONNX is auto-converted in the pipeline so teams using ONNX export paths can supply either format.

Can I run multiple models concurrently?

Yes. The AI runtime loads multiple models simultaneously. Same-model re-entry is prevented by construction. Serialisation mode (per-model or global single-lane) is configurable without recompilation.

How does GDPR live anonymisation work?

Face and plate blur runs as a shader stage before any frame leaves the pipeline. The policy is set at boot time — active from the first frame. Live hot-toggle without reboot is not available today.

Can I develop without NPU hardware?

Yes. Provider-published fakes (AI runtime, GPU ROI shader, frame source/sink, file replay) let you develop and integration-test the full pipeline without NPU hardware, a camera sensor, or Bazel.

How does the zero-copy hand-off work technically?

NV12 dmabufs from the camera component cross the process boundary via SCM_RIGHTS fd passing. The GPU ROI shader imports them into Vulkan via VK_EXT_external_memory_dma_buf and runs WGSL crop, resize, and normalize compute shaders entirely on the GPU. The output tensor dmabuf handle is handed to the AI runtime, which drives the TFLite C API with the silicon-vendor delegate.

Is GPU-to-NPU shared memory available on iMX8M Plus?

No. GPU-to-NPU shared memory via rpcmem/ION (QnnMem_register) is Qualcomm-specific. iMX8M Plus uses GPU ROI extraction without rpcmem/ION. Both targets share the same Vulkan ROI pipeline.

Platform · AI Funnel

Describe your AI flow in TOML.

One TOML graph, one model, one OTA. Cloud Connect retrains, optimizes, and validates against reference hardware; the device dispatches the DAG and runs camera → GPU → NPU with zero pixel-copy.

See the deployment lifecycle See the runtime

AI · intelligence layer

~100 TOPS AI-class silicon QCS family, binary-compatible with QCS6490

5 camera inputs MIPI-CSI, GMSL2, UVC, RTSP, WebRTC

0 CPU pixel reads entire hot path — camera to NPU

Four-stage funnel narrowing left to right: ingest, preprocess, infer, publish. Amber accent on the inference diamond.

Deployment lifecycle

From TOML to first labelled frame.

Three phases. Build runs in Cloud Connect: triage retrain, NPU-only optimization, hardware-in-the-loop validation. Provision runs on device when the bundle lands: DAG dispatched, models loaded, GPU and NPU configured. Run executes per frame: triage detects, ai-funnel routes, the customer container terminates the flow.

1 — Build · Cloud Connect

Customer ships, Cloud Connect validates.

One TOML graph plus the model and a COCO dataset. Cloud Connect retrains the unified triage detector across all customer funnels, enforces NPU-only operations, quantizes against the target NPU, and gates the result on reference hardware. Output is a signed deployment bundle. The registry is Munic-hosted or customer-hosted.

flowchart LR
  CB["Customer container<br/>funnel.toml + models<br/>+ COCO dataset + container code"]
  REG[("Registry<br/>Munic-hosted or<br/>customer-hosted")]
  TT["Triage trainer<br/>multiple funnels → one triage"]
  OPT["NPU optimizer<br/>NPU-only ops · INT8 quant"]
  HIL["HW-in-the-loop validator<br/>reference target hardware<br/>accuracy + latency gate"]
  BUNDLE["Signed deployment bundle"]
  CB --> REG --> TT --> OPT --> HIL --> BUNDLE
  class TT,OPT,HIL ai-node

Build phase. AI-class steps are amber: the triage trainer (multiple funnels merged into one detector), the NPU optimizer (NPU-only ops, INT8 quantization), and the hardware-in-the-loop validator that gates on accuracy and latency.

A bundle that fails the hardware-in-the-loop accuracy or latency gate is rejected back to the customer; only signed, gated bundles reach the OTA channel.

↓ OTA · same channel as code updates

2 — Provision · Device

DAG dispatched. Pipeline armed.

When the bundle lands on the device, the DAG compiled from funnel.toml is dispatched: camera and sensor routing configured, models loaded into the NPU, the GPU shader configured for tensor format, letterboxing, and ROI extraction. Model updates ride the same OTA channel as code, with fleet rollback, staged rollout, and version pinning.

flowchart LR
  PULL["Bundle pulled<br/>same OTA channel as code"]
  DAG["DAG dispatched<br/>compiled from funnel.toml"]
  CFG_CAM["Camera / sensor<br/>routing configured"]
  CFG_NPU["AI runtime<br/>models loaded into NPU"]
  CFG_GPU["GPU ROI shader<br/>tensor format · letterbox · ROI"]
  READY["Ready · first frame"]
  PULL --> DAG
  DAG --> CFG_CAM --> READY
  DAG --> CFG_NPU --> READY
  DAG --> CFG_GPU --> READY
  class CFG_NPU ai-node

Provision phase. The bundle pull and DAG dispatch are system steps; mos-ai-runtime loading the model into the NPU is the AI-class step (amber).

↓ first frame

3 — Run · Per frame

Triage routes; the DAG terminates.

Each frame goes camera → GPU triage tensor → NPU triage. ai-funnel inspects the labels and routes per detection: the GPU re-crops the ROI, the NPU runs the downstream model, the result either loops back to the router for the next stage of the DAG or exits to the customer container — tracking, map, cloud message, CAN — depending on the DAG terminal node.

flowchart LR
  SRC["Camera / sensor<br/>shared GPU frame"]
  GPU1["GPU<br/>triage tensor"]
  NPU1["NPU triage<br/>labels + bboxes"]
  AIF["AI funnel<br/>DAG router"]
  GPU2["GPU<br/>per-detection ROI<br/>letterbox · normalize"]
  NPU2["NPU model<br/>labels + confidence"]
  SINK["Customer container<br/>tracking · map · cloud message · CAN"]
  SRC --> GPU1 --> NPU1 --> AIF
  AIF --> GPU2 --> NPU2 --> AIF
  AIF --> SINK
  class NPU1,NPU2 ai-node

Run phase. NPU inference steps are amber. The router-to-model-back-to-router edges show the DAG loopback; the final edge to the customer container is the DAG terminal.

Shared-memory pipeline

Camera → GPU → NPU. No copies.

The frame handle travels between camera, GPU, and NPU. The pixel data stays in place. No CPU pixel reads in the hot path.

sequenceDiagram
  participant Cam as Camera
  participant Frm as Frame transport
  participant Roi as GPU ROI shader
  participant Ai as AI runtime
  Cam->>Frm: shared GPU frame
  Frm-->>Roi: frame handle
  Note over Roi: GPU import · crop · resize · normalize
  Roi-->>Ai: tensor handle (shared GPU-NPU memory on Qualcomm)
  Note over Ai: NPU inference via vendor delegate
  Ai-->>Roi: SelectRois(bbox + model_id) — tens of bytes, no pixels

No CPU pixel reads in the hot path. The frame handle moves; the pixel data stays in place.

01 — Capture

Camera capture

Five inputs behind one service API. Produces a shared GPU frame regardless of backend. Same entry contract across MIPI-CSI, GMSL2, USB UVC, RTSP, and WebRTC.

02 — GPU crop and resize

GPU ROI shader

Crop, resize, and normalise run entirely on the GPU. A CPU pixel read is a design bug — the rule, not a target. Portable across iMX8M Plus and Qualcomm; Qualcomm additionally uses GPU-to-NPU shared memory.

03 — Inference

AI runtime

Receives the tensor handle. Drives the silicon-vendor NPU delegate on each target. ONNX auto-conversion is available so teams using ONNX export paths can supply either format.

Frame transport

One frame, many consumers — each at its own rate.

One publisher, many subscribers, independent cadences. The same bus carries video frames and inference tensors with identical zero-copy semantics. A slow consumer never stalls the producer.

60 Hz

Video frames

Camera publishes at full frame rate.

30 Hz

GPU crop

The GPU ROI shader consumes at the inference rate without blocking the camera.

10 Hz

Pose tracking · other consumers

Any Python, C++, Rust, Go, or Lua container can pull frames from the same bus through the MQTT bridge — no SDK adoption.

flowchart TD
  P[Camera<br/>shared GPU frame publisher]
  P --> S60[60 Hz video<br/>dashcam encoder]
  P --> S30[30 Hz GPU crop<br/>ROI shader]
  P --> S10[10 Hz pose tracker]
  S30 --> NPU[AI runtime<br/>NPU inference]
  class NPU ai-node
  class S30 ai-node

One publisher, N subscribers, each on its own cadence. A slow consumer cannot stall the producer.

AI-class audiences

From drones to vehicles.

Quad-rotor drone seen from below — four down-facing cameras each showing a tiled ground patch. Amber on the central tile.

AI-class silicon targets any platform where a camera feeds an NPU — aerial drones, ADAS cameras, DMS units, and autonomous vehicles. The same zero-copy pipeline applies across all.

Camera inputs

Five inputs. One shared-frame contract.

Camera inputs and silicon backends
Input	Silicon target	Backend	Entry contract
MIPI-CSI	iMX8M Plus	libcamera / V4L2	shared GPU frame
GMSL2	Qualcomm QCS only	V4L2 subdev + QMMF	shared GPU frame
USB UVC	All targets	V4L2	shared GPU frame
RTSP / ONVIF	All targets	GStreamer	shared GPU frame
WebRTC	All targets	GStreamer	shared GPU frame

GMSL2 is a Qualcomm-only path today. iMX8M Plus uses MIPI-CSI direct.

GDPR anonymisation

Active from the first frame.

Face and plate blur runs as a shader stage before any frame leaves the pipeline. The policy is set at boot time and is active from the first frame on the device.

What is anonymised

· Faces — bounding-box region blur before frame hand-off
· Licence plates — same shader stage, same shared-memory path

Deployment constraint

The policy is set at boot time. A runtime hot-toggle without reboot is not available today. Enable once per device configuration; the frame plane enforces it continuously.

Observability

14 built-in metrics across the pipeline.

GPU ROI shader

· calls_total
· errors_total
· rois_extracted_total
· frame_fetch_failures_total
· frame_fetch_timeouts_total
· duration_seconds (histogram)

AI runtime

· inference_requests_total
· inference_errors_total
· model_load_duration_seconds
· inference_duration_seconds (histogram)
· active_models (gauge)
· watchdog_heartbeat_budget_seconds
· plus error dedup

All at zero per-component author cost — emitted by the framework automatically.

Test doubles

Develop without hardware.

Provider-published fakes for CI and off-target development
Fake component	Test scope	CI-runnable
AI runtime fake	Model inference stub	Yes — no NPU hardware
GPU ROI shader fake	GPU ROI extraction stub	Yes — no GPU required
Mock frame source / sink	Frame bus pub/sub	Yes — pure host
Sim file source	Camera input replay	Yes — file-based

FAQ

Frequently asked questions

Do pixel bytes ever cross the CPU in the hot path?

No, by design. Camera, GPU, and NPU exchange the same shared-memory frame via handle passing — no buffer copies at any step.
Which model formats does the AI runtime accept?

TFLite directly. ONNX is auto-converted in the pipeline so teams using ONNX export paths can supply either format.
Can I run multiple models concurrently?

Yes. The AI runtime loads multiple models simultaneously. Same-model re-entry is prevented by construction. Serialisation mode (per-model or global single-lane) is configurable without recompilation.
How does GDPR live anonymisation work?

Face and plate blur runs as a shader stage before any frame leaves the pipeline. The policy is set at boot time — active from the first frame. Live hot-toggle without reboot is not available today.
Can I develop without NPU hardware?

Yes. Provider-published fakes (AI runtime, GPU ROI shader, frame source/sink, file replay) let you develop and integration-test the full pipeline without NPU hardware, a camera sensor, or Bazel.

Architecture FAQ

Implementation details

How does the zero-copy hand-off work technically?

NV12 dmabufs from the camera component cross the process boundary via SCM_RIGHTS fd passing. The GPU ROI shader imports them into Vulkan via VK_EXT_external_memory_dma_buf and runs WGSL crop, resize, and normalize compute shaders entirely on the GPU. The output tensor dmabuf handle is handed to the AI runtime, which drives the TFLite C API with the silicon-vendor delegate.
Is GPU-to-NPU shared memory available on iMX8M Plus?

No. GPU-to-NPU shared memory via rpcmem/ION (QnnMem_register) is Qualcomm-specific. iMX8M Plus uses GPU ROI extraction without rpcmem/ION. Both targets share the same Vulkan ROI pipeline.

Bring your model and dataset.

Show us the inference task; engineering will walk through the capture-to-NPU path on a target device.

Talk to engineering See Vision capabilities