Safari (Reading List + history fallback)

The Safari connector is an edge harvester that pushes Reading List adds, Bookmarks snapshots, and (optionally) browsing history from a macOS device into Carabase. It’s the “fallback” path: works without a browser extension installed by reading directly from ~/Library/Safari/.

A future browser-extension path (Phase 4 streaming push) lands per-event streams into the same safari_reading_list substrate table — the Reading List/history fallback covers users who haven’t installed the extension yet, or browser sessions where the extension isn’t running.

Highlights — Lane A part 1 (live in v0.1)

The first Safari PR shipped a focused subset of the spec. Reading List adds + Bookmarks-snapshot change-detection are live; per-visit history substrate writes + per-domain history rollups + bookmarks-as-folio-readme land in Lane A part 2.

Reading List items land as substrate rows. Every Reading List add the desktop pushes lands in safari_reading_list keyed on (workspace_id, url_hash) with scrape_status = 'pending' and artifact_id = NULL. The downstream safari-scrape worker that hydrates the body into a Tier-0 artifact + stamps artifact_id lands in part 2 — until then, Reading List rows persist for replay-safety but no searchable artifact is produced.
History is counted, not (yet) persisted per-visit. The desktop pushes the history batch; the route advances safari_sync_state. last_history_synced_at to the newest visit timestamp it sees and returns. Per-visit rows + Tier 2 daily rollup writes are deferred to part 2.
Bookmarks snapshot is hashed. The desktop sends a bookmarks { source, items[], sha256 } blob; the route stores the sha256 + source on safari_sync_state so subsequent syncs short-circuit when the bookmarks haven’t changed. Folio-readme writes for the bookmarks list land in part 2.
Workspace-scoped sync state. safari_sync_state is keyed by workspace_id (PK), not per-device. Multi-device pushes use atomic GREATEST(existing, candidate) upserts so an older sync from device B can’t rewind a newer cursor from device A.

Setup

The Safari edge harvester ships in the Desktop client (separate repo). Pair via the standard edge-harvester pairing flow using connector: "safari". The push endpoint is POST /api/v1/safari/sync.

Filter shape

SafariFilters lives in src/types/sync-rules.ts. The fields below are declared on the type today; only those marked (part 1) are enforced by the part-1 sync route. The rest activate when Lane A part 2 lands.

excludeDomains / includeDomains — host substring match, case-insensitive. (part 2)
dailyTopN — top-N per-domain visit count surfaced in the Tier 2 daily rollup body (default 10). (part 2)
minDailyVisits — drop daily rollups with fewer than this many total visits (default 3, suppresses noise). (part 2)
minVisitCount — per-URL minimum visit count before it appears at all (default 1; set to 2+ to filter out single-tap accidents). (part 2)
includeReadingList / includeBookmarks — gate Reading List + Bookmarks pipelines. (part 1 — Reading List honoured today; bookmarks routing in part 2)
scrapeTopVisits — opt-in body hydration for the top-N visited URLs in the daily rollup. (part 2)
promoteByEntityDomain — Tier-0 promote per-visit when the URL’s host matches an existing entity’s url_domain. (part 2)

Substrate columns

safari_reading_list — per-URL row keyed (workspace_id, url_hash) UNIQUE with url, title, added_at, read_at, scrape_status (closed enum pending | ready | thin | failed), and artifact_id (stamped by the scrape worker on success). See the database schema reference.