
cURL
The handles
All identifiers are plain UUIDs. The content handles —doc_id, capture_id, and passage_id — are derived deterministically, so the same input produces the same handle. search_id and request_id are minted fresh for every response.
| Handle | Identifies | Derivation and stability |
|---|---|---|
doc_id | The canonical document | Derived from canonical_url. The same document keeps the same doc_id across searches and recrawls. |
capture_id | One specific capture of the document | Derived from the capture source, a UTC day bucket, and the retrieval request. A re-fetch on a new day is a new capture with a new capture_id. |
passage_id | One passage in the latest capture | Derived from doc_id, capture_id, ordinal, and a hash of the passage text. It changes when the content changes or a newer capture replaces the old one — a latest-capture handle, not an eternal one. |
search_id | One /v1/search response | The handle /v1/feedback uses to attribute an event to a ranked result set. |
request_id | One HTTP request | For debugging and support; appears on every response, including errors. |
Stable vs. response-local
Identity is stable; ranking is not. Treat them differently:- Safe to store and cite:
doc_id,canonical_url,capture_id,capture_time,passage_id(valid until the content changes),content_digest, andsearch_id(for feedback). - Never store as facts:
rankandscore.value. Theranking.score_scopefield is always"response_local"— scores compare results within one response only, never across responses, modes, or ranker versions.scoreis present only when the second-stage reranker scored the result; infastmode it is absent.
canonical_url vs. source_url
Both exist because deduplication and citation pull in different directions.canonical_url is the normalized representative URL and the input to doc_id: scheme defaults to https, scheme and host are lowercased, the fragment is stripped, tracking parameters (utm_*, fbclid, gclid, msclkid) are removed, and the trailing slash is trimmed except at the root. Two URLs that differ only in tracking noise are the same document with the same doc_id.
source_url is the URL as actually provided or captured — use it when you need the link that was really fetched.
Timestamps and content_digest
Themetadata block on search results carries the document’s observation history. All values are RFC3339 strings.
| Field | Meaning |
|---|---|
first_seen_at | First time the system saw this canonical document |
last_seen_at | Most recent sighting |
last_crawled_at | Capture time of the underlying capture |
extracted_at | When content was extracted from that capture |
published_at | Best-effort publication date parsed from source metadata; may be absent |
content_digest | sha256: followed by the hex digest of the captured content |
content_digest is the drift detector: store it alongside anything you cite, and compare it on the next read. A changed digest means the content changed and any claims built on the old capture need re-verification.
The provenance object
provenance is exactly two fields:
- On
/v1/searchresults,provenanceis returned only atresponse.verbosity: "full"(see response shaping). The defaultstandardverbosity omits it. - On
/v1/document,provenanceis always present when a capture exists — it names the capture the returned content actually came from.
Threading the loop
The handles connect the three verbs: a search result’sdoc_id feeds /v1/document, and the read feeds /v1/feedback with search_id, doc_id, and passage_id together. This runs as written, keylessly, on the anonymous tier:
cURL
For API-key callers, the full ranked result set is recorded under your account before response shaping sheds tail results — feedback on a result that was shed from the response is still valid.
Why this matters for agents
- Cite what you actually read. A bare URL cites whatever the page serves at click time. Citing
doc_idpluscapture_idandcapture_timefrom the/v1/documentprovenance block pins the claim to the content the agent really consumed. - Keep continuation reads honest. When reading a long document in ranges, pin the capture with
content.range.capture_id. If a newer capture has replaced it, the response carries astale_rangewarning instead of silently serving misaligned offsets. See documents for the continuation loop. - Detect content drift. Compare
content_digestbetween reads. Same digest, same content — earlier conclusions still hold. Different digest, re-read before repeating a claim. - Tolerate stale passage handles. Requesting a
passage_idthat no longer exists in the latest capture does not fail:/v1/documentreturns the passages that are still available plus astale_passage_idwarning naming the missing ones.
Next
- Search — modes, ranking, and where the handles are minted
- Documents — the read loop, capture freshness, and range reads
- Quickstart — the full search → read → feedback loop in two minutes