Skip to content

Diff: services/service-input

From 3f1e0da to 3f1e0da

+0 / −0 lines
BeforeAfter
--- ---
schema: foundry-doc-v1 schema: foundry-doc-v1
title: "service-input" title: "service-input"
slug: service-input slug: service-input
short_description: "service-input is the Ring 1 document ingest service that accepts files at the per-tenant boundary, routes them through format-specific parsers, and writes normalized output to the WORM ledger via service-fs." short_description: "service-input is the Ring 1 document ingest service that accepts files at the per-tenant boundary, routes them through format-specific parsers, and writes normalized output to the WORM ledger via service-fs."
category: services category: services
type: topic type: topic
status: active status: active
bcsc_class: public-disclosure-safe bcsc_class: public-disclosure-safe
last_edited: 2026-05-25 last_edited: 2026-05-25
editor: pointsav-engineering editor: pointsav-engineering
paired_with: service-input.es.md paired_with: service-input.es.md
--- ---
`service-input` is the [[three-ring-architecture|Ring 1]] document-intake service in the PointSav system architecture. It accepts files at the per-tenant boundary, routes them through format-specific parsers, and writes the normalized output into the per-tenant [[worm-ledger-design|WORM Immutable Ledger]] via [[service-fs-architecture|`service-fs`]]. `service-input` is the [[three-ring-architecture|Ring 1]] document-intake service in the PointSav system architecture. It accepts files at the per-tenant boundary, routes them through format-specific parsers, and writes the normalized output into the per-tenant [[worm-ledger-design|WORM Immutable Ledger]] via [[service-fs-architecture|`service-fs`]].
## The Anchor position ## The Anchor position
The Ring 1 boundary is the trust perimeter of the system. Every artifact that crosses this boundary must pass through a deterministic, auditable processing step before it is committed to the WORM ledger. `service-input` occupies that position for structured documents. The Ring 1 boundary is the trust perimeter of the system. Every artifact that crosses this boundary must pass through a deterministic, auditable processing step before it is committed to the WORM ledger. `service-input` occupies that position for structured documents.
[[architecture-decisions|SYS-ADR-10]] designates this position The Anchor (function F12 on [[os-console-platform|`os-console`]]): the single, mandatory intake point where document bytes enter Ring 1, undergo deterministic parsing, and are handed off to `service-fs` for permanent append-only storage. Nothing about parsing is delegated to an AI model; nothing about the output schema varies non-deterministically. The same input bytes, on the same parser version, always produce the same `ParsedDocument`. [[architecture-decisions|SYS-ADR-10]] designates this position The Anchor (function F12 on [[os-console-platform|`os-console`]]): the single, mandatory intake point where document bytes enter Ring 1, undergo deterministic parsing, and are handed off to `service-fs` for permanent append-only storage. Nothing about parsing is delegated to an AI model; nothing about the output schema varies non-deterministically. The same input bytes, on the same parser version, always produce the same `ParsedDocument`.
This is the Anchor property: the document's presence in the WORM ledger is anchored to a reproducible deterministic transformation, not to a probabilistic AI inference. This is the Anchor property: the document's presence in the WORM ledger is anchored to a reproducible deterministic transformation, not to a probabilistic AI inference.
## Architecture ## Architecture
The service is structured as an [[mcp-substrate-protocol|MCP]] handler that receives document bytes via `POST /mcp`, dispatches them through a format detector and per-format parser, and forwards the parsed result to `service-fs` via its FsClient. The HTTP boundary exposes three endpoints: `GET /healthz`, `GET /readyz`, and `POST /mcp`. The MCP endpoint is the only write surface. There is no read surface in this service; reads happen at `service-fs` via its own MCP resource interface. The service is structured as an [[mcp-substrate-protocol|MCP]] handler that receives document bytes via `POST /mcp`, dispatches them through a format detector and per-format parser, and forwards the parsed result to `service-fs` via its FsClient. The HTTP boundary exposes three endpoints: `GET /healthz`, `GET /readyz`, and `POST /mcp`. The MCP endpoint is the only write surface. There is no read surface in this service; reads happen at `service-fs` via its own MCP resource interface.
## Format detection ## Format detection
Format detection runs before parser dispatch. The `detect_format` function applies two strategies in order. Format detection runs before parser dispatch. The `detect_format` function applies two strategies in order.
Extension-first matching checks the filename extension against the known set: `.pdf`, `.md` or `.markdown`, `.docx`, `.xlsx`. Extension matching is case-insensitive. Extension-first matching checks the filename extension against the known set: `.pdf`, `.md` or `.markdown`, `.docx`, `.xlsx`. Extension matching is case-insensitive.
Magic-byte fallback: when the extension is absent or unrecognized, the first four bytes of the document are inspected for known magic sequences. PDF documents begin with `%PDF` (hex `25 50 44 46`). DOCX and XLSX documents are ZIP files sharing the `PK` header (`50 4B 03 04`). Magic-byte fallback: when the extension is absent or unrecognized, the first four bytes of the document are inspected for known magic sequences. PDF documents begin with `%PDF` (hex `25 50 44 46`). DOCX and XLSX documents are ZIP files sharing the `PK` header (`50 4B 03 04`).
Format detection returns an `Option<Format>`. A `None` result causes the MCP handler to return a `-32602 Invalid params` error to the caller; the file is not written to the ledger. This detection algorithm is entirely deterministic — no AI inference — per SYS-ADR-07. Format detection returns an `Option<Format>`. A `None` result causes the MCP handler to return a `-32602 Invalid params` error to the caller; the file is not written to the ledger. This detection algorithm is entirely deterministic — no AI inference — per SYS-ADR-07.
## Parsers ## Parsers
Four format-specific parsers are registered at daemon startup: Four format-specific parsers are registered at daemon startup:
**PDF** — uses oxidize-pdf 2.x. Writes the input bytes to a temporary file, opens it via oxidize-pdf, extracts all page text, and deletes the temporary file via an RAII drop guard. The temporary file is never left on disk after the method returns. **PDF** — uses oxidize-pdf 2.x. Writes the input bytes to a temporary file, opens it via oxidize-pdf, extracts all page text, and deletes the temporary file via an RAII drop guard. The temporary file is never left on disk after the method returns.
**Markdown** — uses pulldown-cmark 0.12. Operates on the in-memory byte slice directly. Collects all `Text` and `Code` events from the pulldown-cmark event stream, strips HTML tags, and concatenates results. **Markdown** — uses pulldown-cmark 0.12. Operates on the in-memory byte slice directly. Collects all `Text` and `Code` events from the pulldown-cmark event stream, strips HTML tags, and concatenates results.
**DOCX** — uses docx-rust 0.1.x. Confirms the magic-byte signature before attempting ZIP extraction. Extracts paragraph text with newline separators. **DOCX** — uses docx-rust 0.1.x. Confirms the magic-byte signature before attempting ZIP extraction. Extracts paragraph text with newline separators.
**XLSX** — uses calamine 0.34. Opens the workbook from an in-memory `Cursor`, iterates all sheets, and serializes all rows as tab-separated columns. **XLSX** — uses calamine 0.34. Opens the workbook from an in-memory `Cursor`, iterates all sheets, and serializes all rows as tab-separated columns.
All four parsers return a normalized `ParsedDocument` with a `format` field, a `source_id` passed through from the caller, `text` as extracted text content, and a `metadata` JSON object including a `"parser"` key. All four parsers return a normalized `ParsedDocument` with a `format` field, a `source_id` passed through from the caller, `text` as extracted text content, and a `metadata` JSON object including a `"parser"` key.
## MCP interface ## MCP interface
The MCP endpoint accepts JSON-RPC 2.0 requests via `POST /mcp` with a required `X-Foundry-Module-ID` header. The `document.ingest` tool accepts `filename` (used for format detection), `source_id` (the caller-side document identifier), and `bytes_base64` (base64-encoded document bytes). The MCP endpoint accepts JSON-RPC 2.0 requests via `POST /mcp` with a required `X-Foundry-Module-ID` header. The `document.ingest` tool accepts `filename` (used for format detection), `source_id` (the caller-side document identifier), and `bytes_base64` (base64-encoded document bytes).
On success, the response includes the `cursor` — a monotonically increasing integer assigned by service-fs — as the stable ledger reference for this entry. On error, `-32602 Invalid params` indicates a format detection failure, and `-32603 Internal error` indicates a parser failure or service-fs transport error. On success, the response includes the `cursor` — a monotonically increasing integer assigned by service-fs — as the stable ledger reference for this entry. On error, `-32602 Invalid params` indicates a format detection failure, and `-32603 Internal error` indicates a parser failure or service-fs transport error.
## ADR-07 compliance ## ADR-07 compliance
[[architecture-decisions|ADR-07]] prohibits AI inference in Ring 1. `service-input` maintains this constraint throughout: format detection uses extension matching followed by magic-byte inspection; parsing uses purpose-built libraries that apply deterministic algorithms; text normalization is purely structural. A consequence of this constraint is that `service-input` does not attempt to extract meaning from the parsed text. Semantic interpretation is the responsibility of [[service-extraction|Ring 2 services]] downstream. [[architecture-decisions|ADR-07]] prohibits AI inference in Ring 1. `service-input` maintains this constraint throughout: format detection uses extension matching followed by magic-byte inspection; parsing uses purpose-built libraries that apply deterministic algorithms; text normalization is purely structural. A consequence of this constraint is that `service-input` does not attempt to extract meaning from the parsed text. Semantic interpretation is the responsibility of [[service-extraction|Ring 2 services]] downstream.
## Deployment configuration ## Deployment configuration
| Variable | Required | Default | Description | | Variable | Required | Default | Description |
|---|---|---|---| |---|---|---|---|
| `INPUT_MODULE_ID` | Yes | — | Per-tenant module identifier; must match `FS_MODULE_ID` on the service-fs instance | | `INPUT_MODULE_ID` | Yes | — | Per-tenant module identifier; must match `FS_MODULE_ID` on the service-fs instance |
| `INPUT_FS_URL` | Yes | — | service-fs base URL | | `INPUT_FS_URL` | Yes | — | service-fs base URL |
| `INPUT_BIND_ADDR` | No | `0.0.0.0:9200` | Address and port the MCP HTTP server binds to | | `INPUT_BIND_ADDR` | No | `0.0.0.0:9200` | Address and port the MCP HTTP server binds to |
## Relations to other Ring 1 services ## Relations to other Ring 1 services
`service-input` is one of four Ring 1 boundary-ingest services. Each addresses a distinct document intake channel: `service-input` handles generic documents (PDF, DOCX, XLSX, Markdown); [[service-email]] handles Microsoft Exchange mailboxes; [[service-people]] handles identity records; and `service-fs` is the WORM ledger itself that all three write through. [[service-extraction]] in Ring 2 reads from `service-fs` and applies AI-assisted analysis downstream, outside the Ring 1 boundary. `service-input` is one of four Ring 1 boundary-ingest services. Each addresses a distinct document intake channel: `service-input` handles generic documents (PDF, DOCX, XLSX, Markdown); [[service-email]] handles Microsoft Exchange mailboxes; [[service-people]] handles identity records; and `service-fs` is the WORM ledger itself that all three write through. [[service-extraction]] in Ring 2 reads from `service-fs` and applies AI-assisted analysis downstream, outside the Ring 1 boundary.
## See also ## See also
- [[fs-anchor-emitter]] — the WORM ledger anchoring service - [[fs-anchor-emitter]] — the WORM ledger anchoring service
- [[three-ring-architecture]] — Ring 1 boundary placement and service-input's role - [[three-ring-architecture]] — Ring 1 boundary placement and service-input's role
- [[input-machine]] — the F12 client-side interface to service-input - [[input-machine]] — the F12 client-side interface to service-input