Documents - Caesar

POST /v1/document inspects one canonical document and returns selected content as clean markdown. The usual flow is search → pick a doc_id → read, but you can also read any URL directly — unknown URLs are fetched on demand.

curl -s https://alpha.api.trycaesar.com/v1/document \
  -H "Authorization: Bearer $CAESAR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "canonical_url": "https://www.postgresql.org/docs/17/logical-replication.html",
    "query": "how does failover work"
  }'

from caesar_search import Caesar

client = Caesar()
doc = client.read(
    "https://www.postgresql.org/docs/17/logical-replication.html",
    query="how does failover work",
)
print(doc.content.text if doc.content else doc.doc.title)

import { Caesar } from "caesar-search";

const caesar = new Caesar();

// Pass a doc_id or a URL
const doc = await caesar.read("https://www.postgresql.org/docs/17/logical-replication.html", {
  query: "how does failover work",
});
console.log(doc.content?.text);

caesar-search read https://www.postgresql.org/docs/17/logical-replication.html --query "how does failover work" --json

doc_id or canonical_url

One of the two lookup keys is required (missing both is a 400 validation_error); if you send both, doc_id takes precedence and canonical_url is ignored:

doc_id — the UUID identifier from a search result. It is deterministic per canonical URL and can be reused across searches and recrawls.
canonical_url — any URL. It is normalized before lookup (tracking params stripped, fragment removed, host lowercased), and unknown URLs are fetched on demand.

For API-key accounts, doc_id lookups resolve only documents the account has been granted: the document appeared in one of its search results, or a prior canonical_url lookup granted it. An ungranted doc_id returns 404 document_not_found. A successful canonical_url read grants that doc_id for future calls — so URL-first, then doc_id, always works. If content cannot be fetched, the response still returns doc metadata plus warning content_unavailable.

Choosing sections: include

include selects which response sections to compute. Omitted, you get all available sections.

Value	Returns
`metadata`	the `doc` block (`doc_id`, `canonical_url`, `source_url`, `title`, `meta_description`, `content_digest`, `published_at`, `first_seen_at`, `last_seen_at`, `latest_capture_id`, `headings`)
`content`	the `content` block (see below)
`passages`	structured `passages` for citation
`capture_history`	capture metadata (see below)

Any other value is a 400 validation_error. include: ["metadata"] alone skips the content fetch entirely — the cheapest way to check freshness fields. provenance is always returned when a capture exists, regardless of include — unlike search results, where it is verbosity-gated (provenance).

The official SDKs and CLI default to include: ["metadata", "content"]; the raw API defaults to all sections. Ask for passages or capture_history explicitly when using a client.

Content controls

Field	Values / default	Notes
`content.selection`	`none`, `query_relevant`, `top_passages`, `passage_ids`, `full_document` (default)	clients default to `full_document`; use `query_relevant` explicitly when you want excerpts focused by `query`; `none` omits the content block; `full_document` forces a full fetch when only a snippet-derived capture exists
`content.format`	`text`, `markdown` (default)
`content.max_chars`	integer, min 1, optional	omit for the full selected content; capped reads that exceed the limit set `content.truncated: true` and warning `content_truncated` — the request never fails for size
`content.include_offsets`	boolean	passage `char_start`/`char_end` are included only when true and the capture is a full-document capture

The response content object is selection, format, text, truncated, char_count, plus start_char on range reads.

Continuation reads

A truncated read is resumable. Do not retry with a bigger cap; continue from start_char + char_count using content.range.

curl -s https://alpha.api.trycaesar.com/v1/document \
  -H "Authorization: Bearer $CAESAR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "canonical_url": "https://www.postgresql.org/docs/17/logical-replication.html",
    "content": { "selection": "full_document", "range": { "start_char": 12000 } }
  }'

from caesar_search import Caesar

client = Caesar()
url = "https://www.postgresql.org/docs/17/logical-replication.html"

doc = client.read(url)
while doc.content and doc.content.truncated:
    next_offset = (doc.content.start_char or 0) + doc.content.char_count
    doc = client.read(url, start_char=next_offset)

import { Caesar } from "caesar-search";

const caesar = new Caesar();
const url = "https://www.postgresql.org/docs/17/logical-replication.html";

const doc = await caesar.read(url);
if (doc.content?.truncated) {
  const next = await caesar.read(url, {
    startChar: (doc.content?.start_char ?? 0) + (doc.content?.char_count ?? 0),
  });
}

caesar-search read https://www.postgresql.org/docs/17/logical-replication.html --max-chars 12000 --start-char 12000 --json

The content.range fields:

Field	Meaning
`range.start_char`	character offset to start from (min 0)
`range.max_chars`	cap for this range; overrides `content.max_chars`
`range.capture_id`	optional capture pin; warns `stale_range` when the latest capture differs

Rules that make the loop safe:

The next offset is always start_char + char_count from the previous response (start_char is omitted when 0). The CLI flag is --start-char; the SDKs take startChar (TypeScript) / start_char (Python).
Continuation reads must use selection: "full_document" — offsets address the raw document text, and selection-based excerpting would make them non-contiguous between calls. The raw API does not change your selection; set it yourself on range reads (as in the cURL example above). The official clients do this automatically whenever a start offset is set.
A start_char past the end of the document returns HTTP 200 with empty text and char_count: 0 — your loop terminates cleanly.
stale_range: emitted only when range.start_char is greater than 0 and range.capture_id is set but no longer matches the latest capture. The read still succeeds and is served from the latest capture, but offsets may not line up with your earlier chunks — restart from start_char: 0 if exact continuity matters. Details carry requested_capture_id and latest_capture_id.

Capture history

capture_history entries are capture_id, capture_time, content_digest, and optionally content_format (text for search-snippet captures, markdown for full fetches).

Today capture_history returns the latest capture entry only. It is metadata for change detection — compare content_digest across reads — not a surface for reading old captures.

Content freshness is managed server-side: a document is re-fetched when no capture exists, the capture has no text, the capture TTL expired (search-derived captures roughly 15 minutes, full-document captures roughly 24 hours), or full_document was requested against a snippet-only capture. The provenance block (capture_id, capture_time) always identifies the capture the returned content came from.

Passages

passages are structured spans for citation: passage_id, doc_id, ordinal, text, plus optional section_heading, section_path, and char_start/char_end (offsets only with content.include_offsets: true on a full-document capture). To re-fetch specific passages from a search result, use content.selection: "passage_ids" with content.passage_ids. Passage IDs identify passages in the latest capture, so they change when the document content changes. Stale or unknown IDs never fail the request: available passages are returned plus warning stale_passage_id listing the missing IDs in details.passage_ids. Treat that warning as “the document changed” and re-search or re-read rather than retrying the same IDs. Full request and response schemas are in the API reference; error envelopes and codes in errors.

​doc_id or canonical_url

​Choosing sections: include

​Content controls

​Continuation reads

​Capture history

​Passages

doc_id or canonical_url

Choosing sections: include

Content controls

Continuation reads

Capture history

Passages