Does the LLM require a cloud connection?

No. The LLM runs entirely on-device by default. Cloud access is opt-in behind three independent gates and must be explicitly configured. Offline operation is the default posture.

How does the retrieval refuse gate work?

Each retrieval query (RAG — Retrieval-Augmented Generation) checks cosine similarity against the indexed corpus: the top-1 chunk must reach ≥ 0.55, and the top-3 mean must reach ≥ 0.45. Below threshold the system answers "I do not have information on that" rather than hallucinating. The thresholds are calibrated against a 50-question benchmark.

What voice accuracy can I expect in industrial environments?

The acceptance criterion is WER (Word Error Rate) ≤ 15% at 70–75 dB background noise — typical factory floor levels. A mandatory speech-to-text (STT) vocabulary boost ensures domain-specific terms (part numbers, process codes) are recognised accurately. This is an acceptance criterion for the integration, not a production guarantee for every deployment.

How does the audit manifest support EU AI Act compliance?

Every answer emits a full provenance record on the EventBus: chunk IDs, document paths, model version, similarity scores, and refusal reason where applicable. Records are retained for six months by default. This record can serve as evidence input for EU AI Act §10 (data governance) and §13 (transparency) obligations. See the compliance page for the full posture.

Which open-source models are supported?

The reference configuration uses a 4-bit quantised SmolLM2 360M — approximately 280 MB on-device RAM footprint. Larger or smaller quantised models can be substituted subject to RAM and latency requirements. BGE-small is the default embedding model for RAG retrieval.

Platform · AI Language

On-device LLM, retrieval and voice.

MOS4 AI Language runs a quantised large language model (LLM), a grounded retrieval-augmented generation (RAG) pipeline, and multilingual voice recognition entirely on-device. Cloud is opt-in. Every answer carries a full provenance record.

Talk to engineering See the audit trail

AI · intelligence layer

~280 MB on-device RAM acceptance criterion: SmolLM2 360M (4-bit quantised) on compute-class

≤ 15% WER at 70–75 dB WER (Word Error Rate) acceptance criterion: voice in industrial noise floor

6 months audit retention per-answer provenance record, rolling window

Pillar 1

On-device language model, offline-first.

A quantised small language model (LLM) runs entirely on-device. Cloud access is opt-in behind three independent gates. No cloud dependency for core inference.

Reference model: SmolLM2 360M (4-bit quantised)

Runs on compute-class silicon with approximately 280 MB RAM footprint. Larger or smaller models can be substituted without code changes — subject to RAM and latency budgets. This is an acceptance criterion for the reference configuration.

Three-gate cloud opt-in

Cloud fallback requires explicit configuration at three independent layers: the system prompt must permit cloud escalation, the tool allow-list must include the cloud connector, and the device-level network policy must permit outbound requests. All three gates must be open. Default posture is offline.

Cross-section of a micro service showing the inference layer isolated from cloud connectors — cloud gate shown as a separate module

Pillar 2

Documented RAG (Retrieval-Augmented Generation) with refuse gate.

Grounded retrieval with a calibrated similarity threshold. Below threshold, the system refuses to answer rather than hallucinate.

Cosine similarity thresholds

Top-1 chunk: ≥ 0.55. Top-3 mean: ≥ 0.45. Both are acceptance criteria calibrated against a 50-question benchmark. Thresholds are configurable per deployment.

Refuse gate

Acceptance criterion: ≥ 80% refuse rate on out-of-corpus questions, ≤ 10% false-refuse on in-corpus questions. "I do not have information on that" is the correct answer when retrieval fails.

BGE-small embedding model

The reference embedding model is BGE-small, which runs on-device. The vector index is built from customer-supplied documents during the integration phase and updated via OTA when the corpus changes.

Pillar 3

Four-layer prompt-injection defence.

Acceptance criterion: ≥ 95% deflection rate on a 20-prompt red-team test suite. Defence is layered across input, system prompt, corpus, and tool allow-list.

Input sanitiser — event-rule action

User input passes through a configurable event-rule action that strips known injection patterns before the text reaches the language model. Deny-list is hot-reloadable without restart.

System-prompt fencing

The system prompt template is cryptographically sealed at deployment time. Attempts to override or append to it via user input are blocked at the inference layer.

Corpus-bake block-list

During corpus build, content that matches a configurable block-list is excluded from the vector index. Indirect injection via poisoned documents cannot reach retrieval.

Tool allow-list

The model can only call tools that are explicitly listed in the operator allow-list. No tool escalation is possible without an operator configuration change. Default list is minimal.

Pillar 4

Per-answer audit manifest.

Every answer emits a full provenance record on the EventBus. Six-month rolling retention. Designed as evidence input for EU AI Act §10 and §13 obligations.

EventBus topic: `audit.answer.manifest.{session_id}`

Each answer publishes: chunk IDs used in retrieval, document paths, model version, cosine similarity scores, refusal reason (when applicable), and a timestamp. Provenance is complete and machine-readable.

Six-month rolling retention

Records are retained for six months by default and are accessible via the observability stack. Retention window is configurable. Records are structured for export to compliance tooling.

Timeline showing audit events stamped at each answer — six-month retention window indicated on the right

See EU AI Act posture for the full compliance mapping.

Pillar 5

Multilingual industrial voice.

Whisper-tiny multilingual with mandatory speech-to-text (STT) vocabulary boost. Acceptance criterion: WER (Word Error Rate) ≤ 15% at 70–75 dB factory noise floor.

Whisper-tiny on-device

Speech recognition runs on-device with Whisper-tiny multilingual. No cloud STT dependency. Audio never leaves the device by default.

Mandatory speech recognition vocabulary boost

Domain-specific terms — part numbers, process codes, product names — are injected into the speech-to-text (STT) vocabulary at deployment time. Recognition of customer-specific vocabulary is an integration requirement, not an option.

Streaming text-to-speech (TTS) with Piper

Text-to-speech (TTS) uses Piper for on-device synthesis. First-audio latency target is approximately 200–300 ms. Piper supports multiple languages and voices without a cloud dependency.

Three-stage AI funnel pipeline on dark backdrop — customer provides (TOML + ONNX/TFLite + dataset), Munic cloud retrains/quantises/validates, on-device runtime fuses NPU + GPU with shared memory

Explore further

Related capabilities.

AI Funnel — visual intelligence engine

Declare your vision AI pipeline in TOML. Cloud Connect retrains, packages, and OTAs. Camera to NPU with no CPU pixel copies.

See AI Funnel →

AI Vision — camera and pose tracking

Five camera inputs, GPU crop and resize, NPU inference on AI-class silicon, and visual-inertial pose tracking. The visual intelligence sibling.

See AI Vision →

Compliance · CRA and EU AI Act

CRA vulnerability handling, RED radio compliance, SBOM, and the EU AI Act posture covering audit manifest evidence and threat-model gates.

See compliance →

Hardware — silicon tiers

AI Language runs on compute-class and AI-class silicon. See the hardware page for tier definitions, form factors, and connectivity options.

See hardware →

SDK — six-language developer surface

Extend AI Language with custom event-rule actions, tool plug-ins, and retrieval corpus builders using the six-language SDK including Lua 5.4.

See the SDK →

Kiosk solution

Voice-first kiosk with grounded answers, refuse gate, and a per-answer audit manifest. AI Language as the platform for a complete vertical solution.

See Kiosk solution →

Browse all micro services →

FAQ

Frequently asked questions

Does the LLM require a cloud connection?

No. The LLM runs entirely on-device by default. Cloud access is opt-in behind three independent gates and must be explicitly configured. Offline operation is the default posture.
How does the retrieval refuse gate work?

Each retrieval query (RAG — Retrieval-Augmented Generation) checks cosine similarity against the indexed corpus: the top-1 chunk must reach ≥ 0.55, and the top-3 mean must reach ≥ 0.45. Below threshold the system answers "I do not have information on that" rather than hallucinating. The thresholds are calibrated against a 50-question benchmark.
What voice accuracy can I expect in industrial environments?

The acceptance criterion is WER (Word Error Rate) ≤ 15% at 70–75 dB background noise — typical factory floor levels. A mandatory speech-to-text (STT) vocabulary boost ensures domain-specific terms (part numbers, process codes) are recognised accurately. This is an acceptance criterion for the integration, not a production guarantee for every deployment.
How does the audit manifest support EU AI Act compliance?

Every answer emits a full provenance record on the EventBus: chunk IDs, document paths, model version, similarity scores, and refusal reason where applicable. Records are retained for six months by default. This record can serve as evidence input for EU AI Act §10 (data governance) and §13 (transparency) obligations. See the compliance page for the full posture.
Which open-source models are supported?

The reference configuration uses a 4-bit quantised SmolLM2 360M — approximately 280 MB on-device RAM footprint. Larger or smaller quantised models can be substituted subject to RAM and latency requirements. BGE-small is the default embedding model for RAG retrieval.

Bring your voice-AI use case.

Show us the domain vocabulary and the noise environment — engineering will walk through the RAG configuration, vocabulary boost, and audit setup for your deployment.

Talk to engineering See Kiosk solution