How search works

A surreal figure holding a transparent search orb against warm collage panels, circuit traces, graph paper, and clouds.

Start with a search that asks for enough context to inspect where each result came from:

curl -s https://alpha.api.trycaesar.com/v1/search \
  -H "Authorization: Bearer $CAESAR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "postgres 17 logical replication failover",
    "max_results": 3,
    "response": { "verbosity": "full" }
  }'

The response is not just a list of URLs. Each result carries a doc_id, canonical URL, source URL, timestamps, passages, and capture provenance you can pass to documents, provenance, and feedback.

The contract

Caesar is built around canonical documents, not disposable snippets. The search infrastructure discovers public pages, turns them into durable document records, extracts passages, ranks candidates, and returns identifiers that agents can reuse:

Identifier	Why it matters
`doc_id`	Stable document identity. Use it to read the same canonical document later.
`passage_id`	Specific passage identity within a captured document. Use it for citation-grade evidence.
`capture_id`	The exact capture that produced the content. Use it to detect drift.
`search_id`	The ranked list that produced the result. Use it when sending feedback.

Pipeline

Discover and fetch

The system finds public web documents from crawl queues, links, sitemaps, query demand, recrawl schedules, and source-quality signals. Fetch policy is polite and policy-aware: removals, robots controls, abuse controls, and suppression rules are part of the serving contract.

Canonicalize

Multiple URLs can point at the same underlying document. Caesar normalizes them into a canonical record, preserves the observed source URL, and tracks duplicate or alternate paths without making the agent reason about URL clutter.

Extract and segment

Fetched content is converted into clean document text, metadata, and passages. Passages keep section context and offsets so an agent can quote the right piece without rereading the whole page.

Index and rank

The serving index uses lexical, structural, freshness, source, and quality signals to retrieve candidates. Query understanding applies normalization, obvious spelling fixes, known-item style queries, and source constraints before ranking.

Return evidence

The API returns ranked results plus the IDs and metadata needed to read, cite, and continue. compact protects context windows; standard adds passages; full adds capture provenance.

Coverage

Caesar targets the broad public web: technical docs, blogs, forums, PDFs, product pages, news pages, reference material, and long-tail research pages. It is not a vertical-only index and does not expose different schemas for different content categories. The public docs do not publish a raw page-count number. For agents, raw URL counts are a weak proxy for usefulness; the useful unit is a fresh canonical document with readable content, passage evidence, and provenance. Use result metadata such as canonical_url, source_url, last_seen_at, last_crawled_at, published_at, capture_time, and content_digest to judge whether a result is fresh enough for your task.

Freshness

Freshness is handled at three layers:

Layer	What happens
Recrawl policy	Documents with higher change likelihood, demand, or source importance are revisited sooner.
Query policy	`freshness_policy` can bias retrieval toward recently published or recently captured material.
Response metadata	Search and read responses expose timestamps so agents can decide whether to trust, read again, or search with a stricter freshness constraint.

When freshness matters, request response.verbosity: "full" or read the document after search. That gives the agent capture-level evidence instead of relying on a snippet alone.

Ranking

Ranking starts with candidate retrieval, then adds second-stage reranking and snippets. If reranking is unavailable, Caesar returns first-stage order with a warning instead of failing the search. Scores are response-local. Compare scores inside one response only; do not persist them as universal relevance values. The fields to persist are doc_id, passage_id, capture_id, and search_id.

Feedback

Feedback is optional. Send it after a result clearly helped or failed a task:

caesar-search feedback \
  --event-type result_helpful \
  --search-id "$SEARCH_ID" \
  --doc-id "$DOC_ID"

Feedback is tied to the ranked list via search_id, so it lands on the retrieval context that produced the result.

What Caesar does not hide

Caesar separates retrieval from answer synthesis. /v1/search returns ranked evidence, /v1/document reads the document, and /v1/feedback records what helped. This keeps source identity, freshness, and provenance visible to the agent instead of hiding them behind a generated answer.

​The contract

​Pipeline

​Coverage

​Freshness

​Ranking

​Feedback

​What Caesar does not hide