Platform · AI Language
On-device LLM, retrieval and voice.
MOS4 AI Language runs a quantised large language model (LLM), a grounded retrieval-augmented generation (RAG) pipeline, and multilingual voice recognition entirely on-device. Cloud is opt-in. Every answer carries a full provenance record.
Pillar 1
On-device language model, offline-first.
A quantised small language model (LLM) runs entirely on-device. Cloud access is opt-in behind three independent gates. No cloud dependency for core inference.
Reference model: SmolLM2 360M (4-bit quantised)
Runs on compute-class silicon with approximately 280 MB RAM footprint. Larger or smaller models can be substituted without code changes — subject to RAM and latency budgets. This is an acceptance criterion for the reference configuration.
Three-gate cloud opt-in
Cloud fallback requires explicit configuration at three independent layers: the system prompt must permit cloud escalation, the tool allow-list must include the cloud connector, and the device-level network policy must permit outbound requests. All three gates must be open. Default posture is offline.
Pillar 2
Documented RAG (Retrieval-Augmented Generation) with refuse gate.
Grounded retrieval with a calibrated similarity threshold. Below threshold, the system refuses to answer rather than hallucinate.
Cosine similarity thresholds
Top-1 chunk: ≥ 0.55. Top-3 mean: ≥ 0.45. Both are acceptance criteria calibrated against a 50-question benchmark. Thresholds are configurable per deployment.
Refuse gate
Acceptance criterion: ≥ 80% refuse rate on out-of-corpus questions, ≤ 10% false-refuse on in-corpus questions. "I do not have information on that" is the correct answer when retrieval fails.
BGE-small embedding model
The reference embedding model is BGE-small, which runs on-device. The vector index is built from customer-supplied documents during the integration phase and updated via OTA when the corpus changes.
Pillar 3
Four-layer prompt-injection defence.
Acceptance criterion: ≥ 95% deflection rate on a 20-prompt red-team test suite. Defence is layered across input, system prompt, corpus, and tool allow-list.
Input sanitiser — event-rule action
User input passes through a configurable event-rule action that strips known injection patterns before the text reaches the language model. Deny-list is hot-reloadable without restart.
System-prompt fencing
The system prompt template is cryptographically sealed at deployment time. Attempts to override or append to it via user input are blocked at the inference layer.
Corpus-bake block-list
During corpus build, content that matches a configurable block-list is excluded from the vector index. Indirect injection via poisoned documents cannot reach retrieval.
Tool allow-list
The model can only call tools that are explicitly listed in the operator allow-list. No tool escalation is possible without an operator configuration change. Default list is minimal.
Pillar 4
Per-answer audit manifest.
Every answer emits a full provenance record on the EventBus. Six-month rolling retention. Designed as evidence input for EU AI Act §10 and §13 obligations.
EventBus topic: audit.answer.manifest.{session_id}
Each answer publishes: chunk IDs used in retrieval, document paths, model version, cosine similarity scores, refusal reason (when applicable), and a timestamp. Provenance is complete and machine-readable.
Six-month rolling retention
Records are retained for six months by default and are accessible via the observability stack. Retention window is configurable. Records are structured for export to compliance tooling.
See EU AI Act posture for the full compliance mapping.
Pillar 5
Multilingual industrial voice.
Whisper-tiny multilingual with mandatory speech-to-text (STT) vocabulary boost. Acceptance criterion: WER (Word Error Rate) ≤ 15% at 70–75 dB factory noise floor.
Whisper-tiny on-device
Speech recognition runs on-device with Whisper-tiny multilingual. No cloud STT dependency. Audio never leaves the device by default.
Mandatory speech recognition vocabulary boost
Domain-specific terms — part numbers, process codes, product names — are injected into the speech-to-text (STT) vocabulary at deployment time. Recognition of customer-specific vocabulary is an integration requirement, not an option.
Streaming text-to-speech (TTS) with Piper
Text-to-speech (TTS) uses Piper for on-device synthesis. First-audio latency target is approximately 200–300 ms. Piper supports multiple languages and voices without a cloud dependency.
Explore further
Related capabilities.
AI Funnel — visual intelligence engine
Declare your vision AI pipeline in TOML. Cloud Connect retrains, packages, and OTAs. Camera to NPU with no CPU pixel copies.
AI Vision — camera and pose tracking
Five camera inputs, GPU crop and resize, NPU inference on AI-class silicon, and visual-inertial pose tracking. The visual intelligence sibling.
Compliance · CRA and EU AI Act
CRA vulnerability handling, RED radio compliance, SBOM, and the EU AI Act posture covering audit manifest evidence and threat-model gates.
Hardware — silicon tiers
AI Language runs on compute-class and AI-class silicon. See the hardware page for tier definitions, form factors, and connectivity options.
SDK — six-language developer surface
Extend AI Language with custom event-rule actions, tool plug-ins, and retrieval corpus builders using the six-language SDK including Lua 5.4.
Kiosk solution
Voice-first kiosk with grounded answers, refuse gate, and a per-answer audit manifest. AI Language as the platform for a complete vertical solution.
FAQ
Frequently asked questions
-
Does the LLM require a cloud connection?
No. The LLM runs entirely on-device by default. Cloud access is opt-in behind three independent gates and must be explicitly configured. Offline operation is the default posture.
-
How does the retrieval refuse gate work?
Each retrieval query (RAG — Retrieval-Augmented Generation) checks cosine similarity against the indexed corpus: the top-1 chunk must reach ≥ 0.55, and the top-3 mean must reach ≥ 0.45. Below threshold the system answers "I do not have information on that" rather than hallucinating. The thresholds are calibrated against a 50-question benchmark.
-
What voice accuracy can I expect in industrial environments?
The acceptance criterion is WER (Word Error Rate) ≤ 15% at 70–75 dB background noise — typical factory floor levels. A mandatory speech-to-text (STT) vocabulary boost ensures domain-specific terms (part numbers, process codes) are recognised accurately. This is an acceptance criterion for the integration, not a production guarantee for every deployment.
-
How does the audit manifest support EU AI Act compliance?
Every answer emits a full provenance record on the EventBus: chunk IDs, document paths, model version, similarity scores, and refusal reason where applicable. Records are retained for six months by default. This record can serve as evidence input for EU AI Act §10 (data governance) and §13 (transparency) obligations. See the compliance page for the full posture.
-
Which open-source models are supported?
The reference configuration uses a 4-bit quantised SmolLM2 360M — approximately 280 MB on-device RAM footprint. Larger or smaller quantised models can be substituted subject to RAM and latency requirements. BGE-small is the default embedding model for RAG retrieval.
Bring your voice-AI use case.
Show us the domain vocabulary and the noise environment — engineering will walk through the RAG configuration, vocabulary boost, and audit setup for your deployment.