Platform · AI Funnel
Describe your AI flow in TOML.
One TOML graph, one model, one OTA. Cloud Connect retrains, optimizes, and validates against reference hardware; the device dispatches the DAG and runs camera → GPU → NPU with zero pixel-copy.
Deployment lifecycle
From TOML to first labelled frame.
Three phases. Build runs in Cloud Connect: triage retrain, NPU-only optimization, hardware-in-the-loop validation. Provision runs on device when the bundle lands: DAG dispatched, models loaded, GPU and NPU configured. Run executes per frame: triage detects, ai-funnel routes, the customer container terminates the flow.
1 — Build · Cloud Connect
Customer ships, Cloud Connect validates.
One TOML graph plus the model and a COCO dataset. Cloud Connect retrains the unified triage detector across all customer funnels, enforces NPU-only operations, quantizes against the target NPU, and gates the result on reference hardware. Output is a signed deployment bundle. The registry is Munic-hosted or customer-hosted.
Build phase of the AI funnel deployment lifecycle, inside Cloud Connect. The customer container — funnel.toml, models, COCO dataset, and container code — is pushed to a registry that is either Munic-hosted or customer-hosted. From the registry, the artefacts feed the triage trainer, which merges multiple customer funnels into one unified triage detector. The result feeds the NPU optimizer, which enforces NPU-only operations and INT8 quantization. The optimized model feeds the hardware-in-the-loop validator, which runs the candidate against reference target hardware and gates on accuracy and latency. A bundle that passes is signed and emitted as the deployment bundle.
flowchart LR
CB["Customer container<br/>funnel.toml + models<br/>+ COCO dataset + container code"]
REG[("Registry<br/>Munic-hosted or<br/>customer-hosted")]
TT["Triage trainer<br/>multiple funnels → one triage"]
OPT["NPU optimizer<br/>NPU-only ops · INT8 quant"]
HIL["HW-in-the-loop validator<br/>reference target hardware<br/>accuracy + latency gate"]
BUNDLE["Signed deployment bundle"]
CB --> REG --> TT --> OPT --> HIL --> BUNDLE
class TT,OPT,HIL ai-node A bundle that fails the hardware-in-the-loop accuracy or latency gate is rejected back to the customer; only signed, gated bundles reach the OTA channel.
↓ OTA · same channel as code updates
2 — Provision · Device
DAG dispatched. Pipeline armed.
When the bundle lands on the device, the DAG compiled from funnel.toml is dispatched: camera and sensor routing configured, models loaded into the NPU, the GPU shader configured for tensor format, letterboxing, and ROI extraction. Model updates ride the same OTA channel as code, with fleet rollback, staged rollout, and version pinning.
Provision phase of the AI funnel deployment lifecycle, on the device. The signed bundle is pulled through the same OTA channel as code updates. The DAG compiled from funnel.toml is dispatched, and three configurations fan out in parallel: camera and sensor routing, the AI runtime loading the models into the NPU, and the GPU ROI shader configuring tensor format, letterboxing, and ROI extraction. All three converge at a ready signal — the device is armed for the first frame.
flowchart LR PULL["Bundle pulled<br/>same OTA channel as code"] DAG["DAG dispatched<br/>compiled from funnel.toml"] CFG_CAM["Camera / sensor<br/>routing configured"] CFG_NPU["AI runtime<br/>models loaded into NPU"] CFG_GPU["GPU ROI shader<br/>tensor format · letterbox · ROI"] READY["Ready · first frame"] PULL --> DAG DAG --> CFG_CAM --> READY DAG --> CFG_NPU --> READY DAG --> CFG_GPU --> READY class CFG_NPU ai-node
↓ first frame
3 — Run · Per frame
Triage routes; the DAG terminates.
Each frame goes camera → GPU triage tensor → NPU triage. ai-funnel inspects the labels and routes per detection: the GPU re-crops the ROI, the NPU runs the downstream model, the result either loops back to the router for the next stage of the DAG or exits to the customer container — tracking, map, cloud message, CAN — depending on the DAG terminal node.
Run phase of the AI funnel deployment lifecycle, per frame on the device. The camera or sensor produces a shared GPU frame. The GPU produces a triage tensor. The NPU triage model emits labels and bounding boxes to the AI funnel DAG router. The router dispatches per detection: the GPU re-crops the region of interest with letterboxing and normalization, the NPU runs the downstream model and returns labels and confidence. The router either loops back to chain another model in the DAG or exits to the customer container, where the workload terminates as tracking, mapping, a cloud message, or a CAN message — the terminal node depends on the DAG.
flowchart LR SRC["Camera / sensor<br/>shared GPU frame"] GPU1["GPU<br/>triage tensor"] NPU1["NPU triage<br/>labels + bboxes"] AIF["AI funnel<br/>DAG router"] GPU2["GPU<br/>per-detection ROI<br/>letterbox · normalize"] NPU2["NPU model<br/>labels + confidence"] SINK["Customer container<br/>tracking · map · cloud message · CAN"] SRC --> GPU1 --> NPU1 --> AIF AIF --> GPU2 --> NPU2 --> AIF AIF --> SINK class NPU1,NPU2 ai-node
Shared-memory pipeline
Camera → GPU → NPU. No copies.
The frame handle travels between camera, GPU, and NPU. The pixel data stays in place. No CPU pixel reads in the hot path.
Shared-memory pipeline sequence. The camera publishes a frame in shared GPU memory and passes the handle across process boundaries. The frame transport relays the handle to the GPU ROI shader, which imports the frame and runs crop, resize, and normalize compute shaders entirely on the GPU. The output tensor handle is handed to the AI runtime, which drives the silicon-vendor NPU delegate; on Qualcomm, GPU-to-NPU shared memory (rpcmem/ION) is reused so the NPU does not re-import. The AI runtime returns selection metadata — bounding boxes and model IDs, tens of bytes, no pixels — back to the GPU ROI shader.
sequenceDiagram participant Cam as Camera participant Frm as Frame transport participant Roi as GPU ROI shader participant Ai as AI runtime Cam->>Frm: shared GPU frame Frm-->>Roi: frame handle Note over Roi: GPU import · crop · resize · normalize Roi-->>Ai: tensor handle (shared GPU-NPU memory on Qualcomm) Note over Ai: NPU inference via vendor delegate Ai-->>Roi: SelectRois(bbox + model_id) — tens of bytes, no pixels
01 — Capture
Camera capture
Five inputs behind one service API. Produces a shared GPU frame regardless of backend. Same entry contract across MIPI-CSI, GMSL2, USB UVC, RTSP, and WebRTC.
02 — GPU crop and resize
GPU ROI shader
Crop, resize, and normalise run entirely on the GPU. A CPU pixel read is a design bug — the rule, not a target. Portable across iMX8M Plus and Qualcomm; Qualcomm additionally uses GPU-to-NPU shared memory.
03 — Inference
AI runtime
Receives the tensor handle. Drives the silicon-vendor NPU delegate on each target. ONNX auto-conversion is available so teams using ONNX export paths can supply either format.
Frame transport
One frame, many consumers — each at its own rate.
One publisher, many subscribers, independent cadences. The same bus carries video frames and inference tensors with identical zero-copy semantics. A slow consumer never stalls the producer.
60 Hz
Video frames
Camera publishes at full frame rate.
30 Hz
GPU crop
The GPU ROI shader consumes at the inference rate without blocking the camera.
10 Hz
Pose tracking · other consumers
Any Python, C++, Rust, Go, or Lua container can pull frames from the same bus through the MQTT bridge — no SDK adoption.
One-to-many fanout. The camera publishes a single shared GPU frame stream. The same bus carries it to three subscribers running on independent cadences — a 60 Hz video consumer (dashcam encoder), a 30 Hz GPU crop consumer (the ROI shader), and a 10 Hz pose tracker. The ROI shader feeds its tensor output to the AI runtime for NPU inference. A slow consumer never stalls the producer.
flowchart TD P[Camera<br/>shared GPU frame publisher] P --> S60[60 Hz video<br/>dashcam encoder] P --> S30[30 Hz GPU crop<br/>ROI shader] P --> S10[10 Hz pose tracker] S30 --> NPU[AI runtime<br/>NPU inference] class NPU ai-node class S30 ai-node
AI-class audiences
From drones to vehicles.
AI-class silicon targets any platform where a camera feeds an NPU — aerial drones, ADAS cameras, DMS units, and autonomous vehicles. The same zero-copy pipeline applies across all.
Camera inputs
Five inputs. One shared-frame contract.
| Input | Silicon target | Backend | Entry contract |
|---|---|---|---|
| MIPI-CSI | iMX8M Plus | libcamera / V4L2 | shared GPU frame |
| GMSL2 | Qualcomm QCS only | V4L2 subdev + QMMF | shared GPU frame |
| USB UVC | All targets | V4L2 | shared GPU frame |
| RTSP / ONVIF | All targets | GStreamer | shared GPU frame |
| WebRTC | All targets | GStreamer | shared GPU frame |
GMSL2 is a Qualcomm-only path today. iMX8M Plus uses MIPI-CSI direct.
GDPR anonymisation
Active from the first frame.
Face and plate blur runs as a shader stage before any frame leaves the pipeline. The policy is set at boot time and is active from the first frame on the device.
What is anonymised
- · Faces — bounding-box region blur before frame hand-off
- · Licence plates — same shader stage, same shared-memory path
Deployment constraint
The policy is set at boot time. A runtime hot-toggle without reboot is not available today. Enable once per device configuration; the frame plane enforces it continuously.
Observability
14 built-in metrics across the pipeline.
GPU ROI shader
- · calls_total
- · errors_total
- · rois_extracted_total
- · frame_fetch_failures_total
- · frame_fetch_timeouts_total
- · duration_seconds (histogram)
AI runtime
- · inference_requests_total
- · inference_errors_total
- · model_load_duration_seconds
- · inference_duration_seconds (histogram)
- · active_models (gauge)
- · watchdog_heartbeat_budget_seconds
- · plus error dedup
All at zero per-component author cost — emitted by the framework automatically.
Test doubles
Develop without hardware.
| Fake component | Test scope | CI-runnable |
|---|---|---|
| AI runtime fake | Model inference stub | Yes — no NPU hardware |
| GPU ROI shader fake | GPU ROI extraction stub | Yes — no GPU required |
| Mock frame source / sink | Frame bus pub/sub | Yes — pure host |
| Sim file source | Camera input replay | Yes — file-based |
FAQ
Frequently asked questions
-
Do pixel bytes ever cross the CPU in the hot path?
No, by design. Camera, GPU, and NPU exchange the same shared-memory frame via handle passing — no buffer copies at any step.
-
Which model formats does the AI runtime accept?
TFLite directly. ONNX is auto-converted in the pipeline so teams using ONNX export paths can supply either format.
-
Can I run multiple models concurrently?
Yes. The AI runtime loads multiple models simultaneously. Same-model re-entry is prevented by construction. Serialisation mode (per-model or global single-lane) is configurable without recompilation.
-
How does GDPR live anonymisation work?
Face and plate blur runs as a shader stage before any frame leaves the pipeline. The policy is set at boot time — active from the first frame. Live hot-toggle without reboot is not available today.
-
Can I develop without NPU hardware?
Yes. Provider-published fakes (AI runtime, GPU ROI shader, frame source/sink, file replay) let you develop and integration-test the full pipeline without NPU hardware, a camera sensor, or Bazel.
Architecture FAQ
Implementation details
-
How does the zero-copy hand-off work technically?
NV12 dmabufs from the camera component cross the process boundary via SCM_RIGHTS fd passing. The GPU ROI shader imports them into Vulkan via VK_EXT_external_memory_dma_buf and runs WGSL crop, resize, and normalize compute shaders entirely on the GPU. The output tensor dmabuf handle is handed to the AI runtime, which drives the TFLite C API with the silicon-vendor delegate.
-
Is GPU-to-NPU shared memory available on iMX8M Plus?
No. GPU-to-NPU shared memory via rpcmem/ION (QnnMem_register) is Qualcomm-specific. iMX8M Plus uses GPU ROI extraction without rpcmem/ION. Both targets share the same Vulkan ROI pipeline.
Bring your model and dataset.
Show us the inference task; engineering will walk through the capture-to-NPU path on a target device.