Skip to main content
A surreal figure holding a transparent search orb against warm collage panels, circuit traces, graph paper, and clouds. Start with a search that asks for enough context to inspect where each result came from:
curl -s https://search-api-staging-779189860552.europe-west1.run.app/v1/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "postgres 17 logical replication failover",
    "max_results": 3,
    "response": { "verbosity": "full" }
  }'
The response is not just a list of URLs. Each result carries a doc_id, canonical URL, source URL, timestamps, passages, and capture provenance you can pass to documents, provenance, and feedback.

The contract

Caesar is built around canonical documents, not disposable snippets. The search infrastructure discovers public pages, turns them into durable document records, extracts passages, ranks candidates, and returns handles that agents can reuse:
HandleWhy it matters
doc_idStable document identity. Use it to read the same canonical document later.
passage_idSpecific passage identity within a captured document. Use it for citation-grade evidence.
capture_idThe exact capture that produced the content. Use it to detect drift.
search_idThe ranked list that produced the result. Use it when sending feedback.

Pipeline

1

Discover and fetch

The system finds public web documents from crawl queues, links, sitemaps, query demand, recrawl schedules, and source-quality signals. Fetch policy is polite and policy-aware: removals, robots controls, abuse controls, and suppression rules are part of the serving contract.
2

Canonicalize

Multiple URLs can point at the same underlying document. Caesar normalizes them into a canonical record, preserves the observed source URL, and tracks duplicate or alternate paths without making the agent reason about URL clutter.
3

Extract and segment

Fetched content is converted into clean document text, metadata, and passages. Passages keep section context and offsets so an agent can quote the right piece without rereading the whole page.
4

Index and rank

The serving index uses lexical, structural, freshness, source, and quality signals to retrieve candidates. Query understanding handles normalization, obvious spelling issues, known-item style queries, and source constraints before ranking.
5

Return evidence

The API returns ranked results plus the handles needed to read, cite, continue, and give feedback. compact protects context windows; standard adds passages; full adds capture provenance.

Coverage

Caesar targets the broad public web: technical docs, blogs, forums, PDFs, product pages, news pages, reference material, and long-tail research pages. It is not a vertical-only index and does not expose different schemas for different content categories. The public docs do not publish a raw page-count number. For agents, raw URL counts are a weak proxy for usefulness; the useful unit is a fresh canonical document with readable content, passage evidence, and provenance. Use result metadata such as canonical_url, source_url, last_seen_at, last_crawled_at, published_at, capture_time, and content_digest to judge whether a result is current enough for your task.

Freshness

Freshness is handled at three layers:
LayerWhat happens
Recrawl policyDocuments with higher change likelihood, demand, or source importance are revisited sooner.
Query policyfreshness_policy can bias retrieval toward recently published or recently captured material.
Response metadataSearch and read responses expose timestamps so agents can decide whether to trust, read again, or search with a stricter freshness constraint.
When freshness matters, request response.verbosity: "full" or read the document after search. That gives the agent capture-level evidence instead of relying on a snippet alone.

Ranking

Ranking starts with candidate retrieval, then adds mode-dependent refinement:
ModeUse it whenBehavior
fastYou need a quick lookupFirst-stage ordering, minimal enrichment
standardDefault agent searchCandidate retrieval plus second-stage ranking and snippets
researchYou need deeper evidenceMore context enrichment for stronger passages
Scores are response-local. Compare scores inside one response only; do not persist them as universal relevance values. The durable things to persist are doc_id, passage_id, capture_id, and search_id.

Feedback loop

Agents can close the loop after using a result:
caesar-search feedback \
  --event-type result_helpful \
  --search-id "$SEARCH_ID" \
  --doc-id "$DOC_ID"
Feedback is tied to the exact ranked list via search_id, so the system can learn from the retrieval context that actually produced the result.

What Caesar does not hide

Caesar separates retrieval from answer synthesis. /v1/search returns ranked evidence, /v1/document reads the document, and /v1/feedback records what helped. This keeps source identity, freshness, and provenance visible to the agent instead of hiding them behind a generated answer.