Skip to content

Knowledge graph

Carabase maintains a knowledge graph of every entity you’ve mentioned — people, projects, organizations, concepts, tools, topics, events. The graph is incrementally extracted from your daily notes by the harvest pipeline, then exposed to the agent through the MCP server so it can do things like “what did Alice ship for the fundraise project last quarter?” without you having to manually link anything.

A row in the entities table:

{
id: uuid,
workspace_id: uuid,
name: "Alice Chen",
type: "person", // person, project, concept, organization, tool, topic, event
is_canonical: true, // false for aliases that point at the canonical row
canonical_id: null, // the parent canonical entity if this is an alias
metadata: {
concept_role: "self" | "primary_org" | null, // marks special roots
fixture_id: "...", // debug-only, present on seeded test data
...
}
}

The concept_role field marks two special entities per workspace: the self (the user the workspace is about) and the primary org (their employer). Both are surfaced in agent context to anchor pronoun resolution and “we” references.

People go by multiple names. The graph supports this with an alias mechanism: a non-canonical entity row points at a canonical one via canonical_id. The corpus curator (a nightly worker) proposes alias merges; you accept or reject them through the curation_suggestions UI.

A row in the edges table:

{
id, workspace_id,
source_id, target_id, // entity ids
type: "works_at", // free-form predicate
source_kind: "extracted" | "inferred" | "ambiguous",
confidence: 0.85, // [0, 1]
provenance: { ... } // jsonb bag with the producing generator's evidence
}

The three provenance fields together let downstream consumers ask trust-aware questions:

  • carabase_search_graph exposes min_confidence + source_kinds filters that push down to SQL
  • The default formatter annotates non-extracted edges with [inferred 0.62]-style suffixes when surfacing them to the agent
  • The curator can produce 'inferred' edges with low confidence; the agent sees them but knows to weigh them less than first-hand observations

Adding a new edge producer? Set both source_kind and confidence explicitly. A forgotten value defaults to 'extracted' / 1.0, which silently promotes junk into high-trust territory.

  1. Daily notes accumulate logCards (from your typing, from connectors, from agentic flows)
  2. The harvest pipeline reads each logCard, calls an LLM (the utility-high model role) to extract { entities[], relationships[] }
  3. Entities are upserted (with alias resolution); edges are inserted with source_kind: "extracted" and confidence: 1.0
  4. The memory-graph bridge translates Mem0 facts into edges with appropriate provenance
  5. Nightly, the corpus curator walks the graph and suggests alias merges, role enrichment, edge inferences, and stale-entity cleanup — all as curation_suggestions for you to accept or reject

Through the MCP tools that ship in @carabase/mcp-server:

  • carabase_search_semantic(query, k?) — pgvector semantic search across artifacts
  • carabase_search_graph(start_entity, depth?, min_confidence?, source_kinds?) — graph traversal from a named entity
  • carabase_query_metadata(filters) — structured queries by entity name / folio / date
  • carabase_find_entity_candidates(text) — disambiguation lookup
  • carabase_route_and_execute(query) — picks the right strategy based on the query shape
  • carabase_verify_hypothesis(claim) — semantic search + heuristic NLI to corroborate or contradict

Each result includes provenance + confidence so the agent can phrase its answer with appropriate hedging.

verifyHypothesis is a deterministic, no-LLM-required heuristic for resolving factual uncertainty before the agent commits to an answer. Given a natural-language claim, it:

  1. Runs a semantic search to find supporting passages
  2. Tokenizes the claim (minus stopwords)
  3. Measures content-token overlap with each hit
  4. Flags a hit as contradictory when any sentence containing a claim-token also contains a negation cue (not, no, never, denies, refuted, false, didn't, isn't, wasn't, won't, without, …)
  5. Returns { verdict: 'corroborated' | 'contradicted' | 'mixed' | 'inconclusive', corroborated_by, contradicted_by, considered }

The agent calls this before answering questions of the form “did X happen?” — letting it correct itself instead of confidently stating something the corpus disagrees with.