
doc_id, canonical URL, source URL, timestamps, passages, and capture provenance you can pass to documents, provenance, and feedback.
The contract
Caesar is built around canonical documents, not disposable snippets. The search infrastructure discovers public pages, turns them into durable document records, extracts passages, ranks candidates, and returns handles that agents can reuse:| Handle | Why it matters |
|---|---|
doc_id | Stable document identity. Use it to read the same canonical document later. |
passage_id | Specific passage identity within a captured document. Use it for citation-grade evidence. |
capture_id | The exact capture that produced the content. Use it to detect drift. |
search_id | The ranked list that produced the result. Use it when sending feedback. |
Pipeline
Discover and fetch
The system finds public web documents from crawl queues, links, sitemaps, query demand, recrawl schedules, and source-quality signals. Fetch policy is polite and policy-aware: removals, robots controls, abuse controls, and suppression rules are part of the serving contract.
Canonicalize
Multiple URLs can point at the same underlying document. Caesar normalizes them into a canonical record, preserves the observed source URL, and tracks duplicate or alternate paths without making the agent reason about URL clutter.
Extract and segment
Fetched content is converted into clean document text, metadata, and passages. Passages keep section context and offsets so an agent can quote the right piece without rereading the whole page.
Index and rank
The serving index uses lexical, structural, freshness, source, and quality signals to retrieve candidates. Query understanding handles normalization, obvious spelling issues, known-item style queries, and source constraints before ranking.
Coverage
Caesar targets the broad public web: technical docs, blogs, forums, PDFs, product pages, news pages, reference material, and long-tail research pages. It is not a vertical-only index and does not expose different schemas for different content categories. The public docs do not publish a raw page-count number. For agents, raw URL counts are a weak proxy for usefulness; the useful unit is a fresh canonical document with readable content, passage evidence, and provenance. Use result metadata such ascanonical_url, source_url, last_seen_at, last_crawled_at, published_at, capture_time, and content_digest to judge whether a result is current enough for your task.
Freshness
Freshness is handled at three layers:| Layer | What happens |
|---|---|
| Recrawl policy | Documents with higher change likelihood, demand, or source importance are revisited sooner. |
| Query policy | freshness_policy can bias retrieval toward recently published or recently captured material. |
| Response metadata | Search and read responses expose timestamps so agents can decide whether to trust, read again, or search with a stricter freshness constraint. |
response.verbosity: "full" or read the document after search. That gives the agent capture-level evidence instead of relying on a snippet alone.
Ranking
Ranking starts with candidate retrieval, then adds mode-dependent refinement:| Mode | Use it when | Behavior |
|---|---|---|
fast | You need a quick lookup | First-stage ordering, minimal enrichment |
standard | Default agent search | Candidate retrieval plus second-stage ranking and snippets |
research | You need deeper evidence | More context enrichment for stronger passages |
doc_id, passage_id, capture_id, and search_id.
Feedback loop
Agents can close the loop after using a result:search_id, so the system can learn from the retrieval context that actually produced the result.
What Caesar does not hide
Caesar separates retrieval from answer synthesis./v1/search returns ranked evidence, /v1/document reads the document, and /v1/feedback records what helped. This keeps source identity, freshness, and provenance visible to the agent instead of hiding them behind a generated answer.