Provenance

Search results can drift: URLs change, snippets age, and the source an agent saw may not match what a human opens later. Caesar returns source identity and capture metadata so a citation can point at content that can be verified later.

A glass orb holding layered documents, punched-card textures, botanical shadows, and orbiting provenance paths.

cURL

curl -s https://alpha.api.trycaesar.com/v1/search \
  -H "Authorization: Bearer $CAESAR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "linux kernel amd gpu suspend",
    "max_results": 1,
    "response": { "verbosity": "full" }
  }'

The result, trimmed and annotated:

{
  "search_id": "a7e2b9c1-3d4f-4a5b-8c6d-9e0f1a2b3c4d",  // ID for /v1/feedback
  "ranking": {
    "ranker_version": "reranked_v1",
    "score_scope": "response_local"                      // scores never compare across responses
  },
  "results": [
    {
      "rank": 1,                                          // response-local — never store as a fact
      "doc_id": "0c944fa8-4c8f-4f48-9b08-0fb2fd3438ec",   // document identity
      "canonical_url": "https://example.com/amdgpu-suspend-regression",
      "source_url": "https://example.com/amdgpu-suspend-regression?utm_source=feed",
      "title": "amdgpu: suspend regression report",
      "score": { "value": 0.87 },                         // response-local — never store as a fact
      "metadata": {
        "first_seen_at": "2026-06-12T09:58:00Z",
        "last_seen_at": "2026-06-12T09:58:00Z",
        "last_crawled_at": "2026-06-12T09:58:00Z",
        "extracted_at": "2026-06-12T09:58:00Z",
        "content_digest": "sha256:3f6e…",                 // verify content identity across reads
        "published_at": "2026-06-01T00:00:00Z"
      },
      "provenance": {                                     // only at response.verbosity "full"
        "capture_id": "c3d4e5f6-a7b8-4c9d-8e0f-1a2b3c4d5e6f",
        "capture_time": "2026-06-12T09:58:00Z"
      },
      "passages": [
        {
          "passage_id": "d4e5f6a7-b8c9-4d0e-9f1a-2b3c4d5e6f70",
          "doc_id": "0c944fa8-4c8f-4f48-9b08-0fb2fd3438ec",
          "ordinal": 1,
          "text": "Suspend works on 6.9 but resume fails…"
        }
      ]
    }
  ]
}

The identifiers

All identifiers are plain UUIDs. The content identifiers — doc_id, capture_id, and passage_id — are derived deterministically, so the same input produces the same ID. search_id and request_id are minted fresh for every response.

Identifier	Identifies	Derivation and stability
`doc_id`	The canonical document	Derived from `canonical_url`. The same document keeps the same `doc_id` across searches and recrawls.
`capture_id`	One specific capture of the document	Derived from the capture source, a UTC day bucket, and the retrieval request. A re-fetch on a new day is a new capture with a new `capture_id`.
`passage_id`	One passage in the latest capture	Derived from `doc_id`, `capture_id`, ordinal, and a hash of the passage text. It changes when the content changes or a newer capture replaces the old one — a latest-capture identifier, not an eternal one.
`search_id`	One `/v1/search` response	The ID `/v1/feedback` uses to attribute an event to a ranked result set.
`request_id`	One HTTP request	For debugging and support; appears on every response, including errors.

Identity vs. response-local

Identity persists across reads; ranking is response-local. Treat them differently:

Safe to store and cite: doc_id, canonical_url, capture_id, capture_time, passage_id (valid until the content changes), content_digest, and search_id (for feedback).
Never store as facts: rank and score.value. The ranking.score_scope field is always "response_local" — scores compare results within one response only, never across responses or ranker versions. score is present only when the second-stage reranker scored the result; when reranking is unavailable and results fall back to first-stage order, it is absent.

Replayed searches may be served from cache. Deterministic ordering, identical scores, and identical search_id values across repeated identical queries are not guaranteed. If you need to refer back to a specific result set, keep the search_id from the response you actually received.

canonical_url vs. source_url

Both exist because deduplication and citation pull in different directions. canonical_url is the normalized representative URL and the input to doc_id: scheme defaults to https, scheme and host are lowercased, the fragment is stripped, tracking parameters (utm_*, fbclid, gclid, msclkid) are removed, and the trailing slash is trimmed except at the root. Two URLs that differ only in tracking noise are the same document with the same doc_id. source_url is the URL as actually provided or captured — use it when you need the link that was really fetched.

Timestamps and content_digest

The metadata block on search results carries the document’s observation history. All values are RFC3339 strings.

Field	Meaning
`first_seen_at`	First time the system saw this canonical document
`last_seen_at`	Most recent sighting
`last_crawled_at`	Capture time of the underlying capture
`extracted_at`	When content was extracted from that capture
`published_at`	Best-effort publication date parsed from source metadata; may be absent
`content_digest`	`sha256:` followed by the hex digest of the captured content

content_digest is the drift detector: store it alongside anything you cite, and compare it on the next read. A changed digest means the content changed and any claims built on the old capture need re-verification.

The provenance object

provenance is exactly two fields:

{ "capture_id": "c3d4e5f6-a7b8-4c9d-8e0f-1a2b3c4d5e6f", "capture_time": "2026-06-12T09:58:00Z" }

Where it appears differs by endpoint:

On /v1/search results, provenance is returned only at response.verbosity: "full" (see response shaping). The default standard verbosity omits it.
On /v1/document, provenance is always present when a capture exists — it names the capture the returned content actually came from.

Search, read, cite

The IDs connect search results to full-document reads. This runs with CAESAR_API_KEY:

cURL

BASE=https://alpha.api.trycaesar.com

# 1. Search — keep search_id and the doc_id you act on
SEARCH=$(curl -s $BASE/v1/search \
  -H "Authorization: Bearer $CAESAR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"query": "linux kernel amd gpu suspend", "max_results": 3}')
SEARCH_ID=$(printf '%s' "$SEARCH" | jq -r .search_id)
DOC_ID=$(printf '%s' "$SEARCH" | jq -r '.results[0].doc_id')

# 2. Read — provenance pins the capture the content came from
DOCUMENT=$(curl -s $BASE/v1/document \
  -H "Authorization: Bearer $CAESAR_API_KEY" \
  -H "Content-Type: application/json" \
  -d "{\"doc_id\": \"$DOC_ID\", \"query\": \"suspend regression\"}")
CAPTURE_ID=$(printf '%s' "$DOCUMENT" | jq -r .provenance.capture_id)
PASSAGE_ID=$(printf '%s' "$DOCUMENT" | jq -r '.passages[0].passage_id')

If you send feedback later, use the search_id, doc_id, and passage_id from the response you actually used.

Why this matters for agents

Cite what you actually read. A bare URL cites whatever the page serves at click time. Citing doc_id plus capture_id and capture_time from the /v1/document provenance block pins the claim to the content the agent really consumed.
Keep continuation reads honest. When reading a long document in ranges, pin the capture with content.range.capture_id. If a newer capture has replaced it, the response carries a stale_range warning instead of silently serving misaligned offsets. See documents for the continuation loop.
Detect content drift. Compare content_digest between reads. Same digest, same content — earlier conclusions still hold. Different digest, re-read before repeating a claim.
Tolerate stale passage IDs. Requesting a passage_id that no longer exists in the latest capture does not fail: /v1/document returns the passages that are still available plus a stale_passage_id warning naming the missing ones.

Search — ranking and where source IDs are returned
Documents — the read loop, capture freshness, and range reads
Quickstart — search and read in two minutes

Core concepts

The identifiers

Identity vs. response-local

canonical_url vs. source_url

Timestamps and content_digest

The provenance object

Search, read, cite

Why this matters for agents

Next

​The identifiers

​Identity vs. response-local

​canonical_url vs. source_url

​Timestamps and content_digest

​The provenance object

​Search, read, cite

​Why this matters for agents

​Next

The identifiers

Identity vs. response-local

canonical_url vs. source_url

Timestamps and content_digest

The provenance object

Search, read, cite

Why this matters for agents

Next