POST /v1/document inspects one canonical document and returns selected content as clean markdown. The usual flow is search → pick a doc_id → read, but you can also read any URL directly — unknown URLs are fetched on demand.
curl -s https://search-api-staging-779189860552.europe-west1.run.app/v1/document \
-H "Content-Type: application/json" \
-d '{
"canonical_url": "https://www.postgresql.org/docs/17/logical-replication.html",
"query": "how does failover work",
"content": { "selection": "query_relevant", "format": "markdown", "max_chars": 12000 }
}'
doc_id or canonical_url
One of the two lookup keys is required (missing both is a 400 validation_error); if you send both, doc_id takes precedence and canonical_url is ignored:
doc_id — the UUID handle from a search result. It is deterministic per canonical URL, so it stays stable across searches and recrawls.
canonical_url — any URL. It is normalized before lookup (tracking params stripped, fragment removed, host lowercased), and unknown URLs are fetched on demand.
For API-key accounts, doc_id lookups resolve only documents the account has been granted: the document appeared in one of its search results, or a prior canonical_url lookup granted it. An ungranted doc_id returns 404 document_not_found. A successful canonical_url read grants that doc_id for future calls — so URL-first, then doc_id, always works.
If content cannot be fetched, the response still returns doc metadata plus warning content_unavailable.
Choosing sections: include
include selects which response sections to compute. Omitted, you get all available sections.
| Value | Returns |
|---|
metadata | the doc block (doc_id, canonical_url, source_url, title, meta_description, content_digest, published_at, first_seen_at, last_seen_at, latest_capture_id, headings) |
content | the content block (see below) |
passages | structured passages for citation |
capture_history | capture metadata (see below) |
Any other value is a 400 validation_error. include: ["metadata"] alone skips the content fetch entirely — the cheapest way to check freshness fields. provenance is always returned when a capture exists, regardless of include — unlike search results, where it is verbosity-gated (provenance).
The official SDKs and CLI default to include: ["metadata", "content"]; the raw API defaults to all sections. Ask for passages or capture_history explicitly when using a client.
Content controls
| Field | Values / default | Notes |
|---|
content.selection | none, query_relevant (default), top_passages, passage_ids, full_document | clients send query_relevant when you pass a query and full_document otherwise; none omits the content block; full_document forces a full fetch when only a snippet-derived capture exists |
content.format | text, markdown (default) | |
content.max_chars | integer, min 1, default 12000 | larger content is truncated with content.truncated: true and warning content_truncated — the request never fails for size |
content.include_offsets | boolean | passage char_start/char_end are included only when true and the capture is a full-document capture |
The response content object is selection, format, text, truncated, char_count, plus start_char on range reads.
Continuation reads
A truncated read is not a dead end, and the wrong move is retrying with a bigger cap. The response tells you exactly where to resume: continue from start_char + char_count using content.range.
curl -s https://search-api-staging-779189860552.europe-west1.run.app/v1/document \
-H "Content-Type: application/json" \
-d '{
"canonical_url": "https://www.postgresql.org/docs/17/logical-replication.html",
"content": { "selection": "full_document", "range": { "start_char": 12000 } }
}'
The content.range fields:
| Field | Meaning |
|---|
range.start_char | character offset to start from (min 0) |
range.max_chars | cap for this range; overrides content.max_chars |
range.capture_id | optional capture pin; warns stale_range when the latest capture differs |
Rules that make the loop safe:
- The next offset is always
start_char + char_count from the previous response (start_char is omitted when 0). The CLI flag is --start-char; the SDKs take startChar (TypeScript) / start_char (Python).
- Continuation reads must use
selection: "full_document" — offsets address the raw document text, and selection-based excerpting would make them non-contiguous between calls. The raw API does not change your selection; set it yourself on range reads (as in the cURL example above). The official clients do this automatically whenever a start offset is set.
- A
start_char past the end of the document returns HTTP 200 with empty text and char_count: 0 — your loop terminates cleanly.
stale_range: emitted only when range.start_char is greater than 0 and range.capture_id is set but no longer matches the latest capture. The read still succeeds and is served from the latest capture, but offsets may not line up with your earlier chunks — restart from start_char: 0 if exact continuity matters. Details carry requested_capture_id and latest_capture_id.
Capture history
capture_history entries are capture_id, capture_time, content_digest, and optionally content_format (text for search-snippet captures, markdown for full fetches).
Today capture_history returns the latest capture entry only. It is metadata for change detection — compare content_digest across reads — not a surface for reading old captures.
Content freshness is managed server-side: a document is re-fetched when no capture exists, the capture has no text, the capture TTL expired (search-derived captures roughly 15 minutes, full-document captures roughly 24 hours), or full_document was requested against a snippet-only capture. The provenance block (capture_id, capture_time) always identifies the capture the returned content came from.
Passages
passages are structured spans for citation: passage_id, doc_id, ordinal, text, plus optional section_heading, section_path, and char_start/char_end (offsets only with content.include_offsets: true on a full-document capture).
To re-fetch specific passages from a search result, use content.selection: "passage_ids" with content.passage_ids. Passage IDs are latest-capture handles — they change when the document content changes. Stale or unknown IDs never fail the request: available passages are returned plus warning stale_passage_id listing the missing IDs in details.passage_ids. Treat that warning as “the document moved on” and re-search or re-read rather than retrying the same IDs.
Full request and response schemas are in the API reference; error envelopes and codes in errors.