Strategies - Fracta

The Reading Garden ships three strategies, each owning one stage of the CODE flow (Capture+Organize, Distill, Express). Since v0.5.2 they live under strategies/knowledge-garden/ (promoted out of _example/ so a fresh deploy discovers them automatically) and follow the standard fracta strategy shape: contract.yaml + strategy.py + binding.yaml. Strategies are deterministic Python DAGs run by the fracta strategy runner — not LLM calls. The pipeline prompt on First Run is the only place an LLM holds the framing; each strategy_run is a Python pipeline.

What ships

Strategy	Category	Purpose	Source
`highlight-distill`	enrichment	Pull Readwise highlights and documents, extract Concept and Entity candidates via three NLP MCPs, write to graph.	strategies/knowledge-garden/enrichment/highlight_distill
`cross-source-concepts`	correlation	Rescore every Concept by graph-wide signal; update `Concept.confidence`, `Concept.mention_count`, and `MENTIONS.weight`.	strategies/knowledge-garden/correlation/cross_source_concepts
`notion-publish`	traversal	Render a Concept (or top-N batch) as a Notion page; create-or-update idempotently via `Publication.content_hash`.	strategies/knowledge-garden/traversal/notion_publish

The three are designed to run as a pipeline (in order), but each is independently invocable — useful for iterating on one stage without re-pulling raw data.

highlight-distill

Purpose. Pull recent Readwise highlights and documents through the gateway, group them by source document, fan out three parallel NLP extractor calls per highlight, merge the results with an asymmetric scoring algorithm, and write Highlight, Document, Concept, and Entity nodes (plus MENTIONS / PART_OF / CAPTURED_FROM edges) into the graph. This is the Capture + Organize stage. It is the only strategy in the pattern that calls upstream MCPs for data ingest.

Inputs

Param	Type	Required	Default	Description
`watermark_iso`	string	no	`"1970-01-01T00:00:00Z"`	Pull only highlights with `updated > watermark`. The contract default is the documented “backfill from the beginning” timestamp; pass an explicit recent ISO (e.g. `2026-04-01T00:00:00Z`) on incremental runs to keep the Readwise call count bounded. The strategy also accepts a rolling sentinel `-7d` (resolved at runtime for DuckDB filtering); see `contract.yaml` for the current behaviour.
`page_size`	int	no	`100`	Readwise `list_highlights` page size. The hosted MCP caps at 20 req/min, so larger pages mean fewer round-trips.

Outputs

Graph mutations only — no tabular return payload. After a run, expect:

One Highlight node per ingested Readwise highlight.
One Document node per source book / article / podcast.
Concept nodes for high-signal keyphrases (asymmetric score >= 0.40).
Entity nodes for gliner-typed spans (Person / Organisation / Place / Work / Product).
MENTIONS edges with weight, extracted_by (pipe-joined extractor IDs), extraction_score, and agreement_n (1, 2, or 3).
CAPTURED_FROM edges to the Readwise Highlights DomainSource.

Borderline candidates (score in [0.30, 0.40)) land in a DuckDB pending_extractions table — not graph — and graduate on second appearance.

Extractor fan-out

Each highlight triggers three parallel MCP calls via ThreadPoolExecutor(max_workers=8):

Extractor	Tool	Score semantics	Role
`concept-keybert`	`keybert_extract_tool`	Within-document MMR-reranked cosine — not a probability.	Concept candidates when GLiNER doesn’t fire; consumed as a rank tier (top-3 = 0.70, 4–10 = 0.50, > 10 = 0.30), never as raw cosine.
`concept-gliner`	`gliner_extract_tool`	DeBERTa-v3 sigmoid in `[0, 1]`; comparable across spans.	Authoritative typer and score-of-record where it fires. Routes spans to `Entity.kind` (Person/Org/Place/Work/Product) or `Concept` (Theory/Method/Concept/Tool).
`concept-spacy`	`spacy_extract_tool`	OntoNotes-5 NER + noun chunks — categorical, no score field.	Fallback typer when GLiNER doesn’t fire; noun chunks (filtered `root_pos != PRON`) are last-resort candidates; contributes to `agreement_n` but not to `extraction_score`.

The asymmetric score derivation (per merged candidate cluster):

base = gliner_sigmoid              if gliner fired
     = keybert_rank_tier            elif keybert fired
     = SPACY_NER_PRIOR (0.55)       elif spacy NER fired
     = SPACY_NP_PRIOR (0.30)        else (noun-chunk only)

extraction_score = clip(
    base
  + AGREEMENT_BONUS[n_extractors]   # {1: 0.0, 2: 0.15, 3: 0.30}
  + (0.05 if typed_labels_agree else 0.0)
  - (0.10 if weak_substring_join else 0.0)
, 0.0, 1.0)

Followed by MMR diversification (lambda=0.7, K=8 per highlight, Jaccard similarity over canonical tokens — KeyBERT does not expose per-keyphrase embeddings, so embedding cosine is unavailable in v1).

Call it

fracta spawn \
  --task ingest-recent \
  --contract "Call strategy_run(name='highlight-distill', params={'watermark_iso': '2025-10-01T00:00:00Z', 'page_size': 100}) and report counts of Highlight, Document, and Concept nodes written."

cross-source-concepts

Purpose. Read the graph; rescore every Concept using recency × frequency × source diversity, folded with the mean of MENTIONS.extraction_score; write back Concept.confidence, Concept.mention_count, and MENTIONS.weight. Surface high-confidence Concepts not yet linked to a Topic via the high_confidence_concept_without_topic checkpoint rule. This is the Distill stage. It is a pure-DuckDB-over-graph computation — no MCP fetch.

Inputs

This strategy takes no params in v1.

Outputs

Graph mutations only:

Concept.confidence — updated for every Concept the rescore touched.
Concept.mention_count — refreshed from the actual inbound MENTIONS count.
MENTIONS.weight — derived from extraction_score × recency_decay.

Scoring formula

confidence = sigmoid(
    w_freq      * log1p(mention_count)
  + w_recency   * recency_decay(last_seen_at)
  + w_diversity * domain_source_count           # via CAPTURED_FROM -> DomainSource
  + w_extract   * mean(MENTIONS.extraction_score)
)

Default weights: {freq: 0.30, recency: 0.15, diversity: 0.40, extract: 0.15}. The w_extract term is what folds highlight_distill’s per-extraction signal into the graph-aware confidence — without cross_source_concepts writing extraction_score itself. The two strategies never overwrite each other’s field.

Call it

fracta spawn \
  --task rescore \
  --contract "Call strategy_run(name='cross-source-concepts') and report the top 10 concepts by confidence with their mention_count."

notion-publish

Purpose. Mirror the knowledge-garden into Notion as a three-database structure — Sources, Highlights, and Concepts — linked via RELATION columns so a reader can navigate from a published Concept back into the highlights and books that produced it. Idempotent across all three sinks via Publication.content_hash keyed by sink + external_id. Computes Concept.epistemic_status from Concept.confidence and writes it back to the graph before rendering. This is the Express stage. Since v0.5.2 the published artefact is the three-DB mirror, not a flat dump of Concept pages.

Inputs

Param	Type	Required	Default	Description
`concept_name`	string	no	—	Publish a single Concept by canonical name. Omit for batch mode.
`notion_concepts_database_id`	string	yes	—	`data_source_id` of the Concepts DB. (Legacy: `notion_database_id` is still accepted.)
`notion_highlights_database_id`	string	yes	—	`data_source_id` of the Highlights DB.
`notion_sources_database_id`	string	yes	—	`data_source_id` of the Sources DB.
`batch_size`	int	no	`10`	In batch mode, publish the top-N Concepts by confidence.

Outputs

For each Concept published:

A new or updated Notion page in the Concepts database, with a highlights RELATION column pointing at one or more rows in the Highlights database.
One Notion page in the Highlights database per supporting highlight (de-duped across concepts), with a source RELATION column pointing at the Source database row for its Readwise book.
One Notion page in the Sources database per distinct Readwise book/article/podcast referenced.
A Publication node MERGEd per page with sink: notion:source | notion:highlight | notion:concept, external_id, and content_hash.
A PUBLISHED_AS edge wiring the source graph node to its Publication.

Pipeline

The strategy is a 7-step DAG:

load_target_concepts — top-N Concepts by confidence (or a single concept by concept_name).
load_supporting_highlights — Highlights mentioning each target Concept, with their denormalised book_* properties.
load_sources — distinct Documents reachable from the loaded Highlights.
publish_sources — idempotent per readwise_book_id; sink notion:source.
publish_highlights — idempotent per readwise_highlight_id; sink notion:highlight; carries a source RELATION to its Source page.
render_concepts — produces the Markdown body + properties dict per Concept.
publish_concepts — idempotent per Concept; sink notion:concept; carries the highlights RELATION assembled from step 5’s id_to_url map.

Notion MCP tool surface (v0.5.2)

The strategy calls four hosted Notion MCP tools:

notion-search + notion-fetch — locate-or-adopt path. After a notion-search hit, the strategy calls notion-fetch and requires an exact match on properties.concept_name (or equivalent) before adopting. Avoids the v1 cross-overwrite trap.
notion-create-pages — creates new pages. The payload shape is {parent: {data_source_id: ...}, pages: [{properties: {...}, content: "markdown..."}]}. Properties are flat scalars (text, number, date), not the v1 wrapped objects. content is enhanced-Markdown, not block JSON.
notion-update-page — updates properties and/or content. Use command: replace_content for a content-hash-driven full refresh; command: insert_content for appends.
RELATION properties are sent as JSON-stringified arrays of page IDs, not native arrays.

Idempotency

The Publication node is the source of truth — local lookup first, then API. The publish sequence per page (Source, Highlight, or Concept):

MATCH (p:Publication {sink: <sink>, external_id: <book_id|highlight_id|concept_name>}).
If found and p.content_hash == new_hash -> skip; no API calls.
If found and hash differs -> notion-update-page with command: replace_content.
If not found -> notion-search + notion-fetch strict-match; on hit, adopt; otherwise notion-create-pages.
Update Publication.content_hash, last_updated_at. Wire PUBLISHED_AS.

The epistemic_status mapping at publish time:

confidence < 0.4 -> seedling
0.4 <= confidence <= 0.8 -> budding
confidence > 0.8 -> evergreen

Call it

fracta spawn \
  --task publish-one \
  --contract "Call strategy_run(name='notion-publish', params={'concept_name': 'falsifiability', 'notion_concepts_database_id': 'CONCEPTS_DS', 'notion_highlights_database_id': 'HIGHLIGHTS_DS', 'notion_sources_database_id': 'SOURCES_DS'}) and report the URLs of every created or updated page."

Why three strategies, not one

Splitting capture, distill, and express into three strategies lets you run each independently — re-publish without re-ingesting; re-score without re-extracting; iterate on one stage at a time. It also enforces the ownership seam: highlight-distill owns extraction_score; cross_source_concepts owns confidence; notion_publish owns epistemic_status and Publication.*. No two strategies write the same field, which keeps the checkpoint rules meaningful and the graph honest.

​What ships

​highlight-distill

​Inputs

​Outputs

​Extractor fan-out

​Call it

​cross-source-concepts

​Inputs

​Outputs

​Scoring formula

​Call it

​notion-publish

​Inputs

​Outputs

​Pipeline

​Notion MCP tool surface (v0.5.2)

​Idempotency

​Call it

​Why three strategies, not one

​What’s next

What ships

highlight-distill

Inputs

Outputs

Extractor fan-out

Call it

cross-source-concepts

Inputs

Outputs

Scoring formula

Call it

notion-publish

Inputs

Outputs

Pipeline

Notion MCP tool surface (v0.5.2)

Idempotency

Call it

Why three strategies, not one

What’s next