docs: bring all documentation current with shipped work
A multi-agent audit of every doc against the code surfaced ~50 stale/missing
items (the roadmap/status docs and the backlog had fallen behind the code).
This catches them up:
- CLAUDE.md: phase status was ~3 phases stale ("Phase 1 is next" while Phase 1 +
chunks of 2 & 4 shipped). Rewrote the status list; added a model-provider
tech-stack entry; updated repo-layout (integrations objectstore/models,
deploy backup.sh/dev compose).
- ARCHITECTURE.md: §6 privacy engine described 3 visibility levels — corrected to
the shipped 4 (adds site_members); documented per-tree AI policy on Tree,
LLMProvider/EmbeddingProvider split + registry, ChangeProposal origin/status/
operations, verified-email session gate, instance-owner role, schema-drift
guard, and the env_file config model.
- PRD.md: 4-level visibility in US-040/§5.5, instance-owner role (§5.1/§5.11),
per-tree AI policy (§5.8), §8 sequencing annotated with shipped status, header
date/status bumped.
- README.md: 4-level privacy; softened "Full GEDCOM 7" to the 5.5.1/7 common
subset; noted backups + instance-owner admin; moved property/land to an
explicit "where it's headed" (no property models exist yet).
- BACKLOG.md: flipped ~15 shipped-but-open rows to Have (ChangeProposal, provider
abstraction, GEDCOM citation export, membership management, operator backup,
email-verification gate, per-tree AI policy, instance owner, the whole
visibility/public-viewing/child-resource-redaction cluster #41-#51/#46), and
reconciled the executive summary, "current defects" list, quick wins, and
differentiators. Left genuinely-open items (citation/source redaction, sitemap,
per-tree noindex, scoped-token API) accurately open.
- .env.example: dropped "SMTP wired in a later phase"; documented the worker
purge knobs, S3_PRESIGN_TTL, COOKIE_NAME; removed a stray duplicate line.
- design/: tree-visibility.md and change-proposal.md marked Shipped; corrected
the redaction approach (reuses member schemas, not a separate PublicPersonRead)
and the apply() rollback claim (v1 is not cross-op transactional), and marked
rate-limiting/sitemap/noindex as deferred.
No code changes.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Justin Paul <justin@jpaul.me>
This commit is contained in:
+36
-39
@@ -16,19 +16,18 @@
|
||||
|
||||
**Where Provenance is strong today.** The foundation is genuinely solid and, in several places, ahead of the OSS field:
|
||||
|
||||
- **Sources-first spine is real.** A reusable `Source` + per-fact `Citation` two-tier model with a `exactly_one_target` CHECK constraint, confidence enum, and full backend CRUD. This is the architectural thing webtrees/Gramps get right and most commercial tools bury. (Caveat: citations are silently dropped on GEDCOM *export* — see below.)
|
||||
- **Privacy architecture is the right shape.** A single `privacy.py` engine, `TenantScoped` mixin on every row, living-person heuristic (`is_possibly_living`, unknown-birth-treated-as-living), and media served **through the backend rather than via raw S3 URLs**. The *shape* is correct; coverage is not yet complete (the media endpoint and several child resources don't yet apply `person_visibility` — see §2.4, §2.10).
|
||||
- **Sources-first spine is real.** A reusable `Source` + per-fact `Citation` two-tier model with a `exactly_one_target` CHECK constraint, confidence enum, and full backend CRUD. This is the architectural thing webtrees/Gramps get right and most commercial tools bury.
|
||||
- **Privacy architecture is the right shape — and coverage is now broad.** A single `privacy.py` engine, `TenantScoped` mixin on every row, living-person heuristic (`is_possibly_living`, unknown-birth-treated-as-living), and media served **through the backend rather than via raw S3 URLs**. Non-member reads of persons, events, media, names, and relationships all route through `person_visibility` (#46). The remaining gap is the `citation`/`source` list endpoints, which still gate only on `can_view_tree` — see §2.10.
|
||||
- **Non-destructive by design.** Soft-delete with timed purge worker, immutable `AuditEntry` (before/after JSONB, `actor_type` ready for the assistant), GEDCOM merge that copies rather than overwrites, full account export/import.
|
||||
- **Modeling maturity.** Typed parent/child qualifiers (biological/adoptive/step/foster/donor/guardian), typed alternate names with one-primary invariant, dual verbatim+normalized dates, duplicate-relationship guards, UUID surrogate keys.
|
||||
- **Standards core.** GEDCOM 5.5.1 import/export is **functional** (with preview/merge-vs-create resolution UI), pg_trgm fuzzy name search, multi-tenant tree hosting with visibility tiers. Round-trip *fidelity* has four tracked gaps (citation links, custom tags, PLAC coords/hierarchy, non-UTF-8 encoding) — see §2.11.
|
||||
- **Standards core.** GEDCOM 5.5.1 import/export is **functional** (with preview/merge-vs-create resolution UI), pg_trgm fuzzy name search, multi-tenant tree hosting with visibility tiers. Round-trip *fidelity* has three tracked gaps (custom tags, PLAC coords/hierarchy, non-UTF-8 encoding) — see §2.11.
|
||||
|
||||
**Documentation-vs-code gaps to correct now (per "docs travel with code").** Three repo claims are not yet true and should be edited in the same spirit they were written:
|
||||
**Documentation-vs-code gaps to correct now (per "docs travel with code").** Two repo claims are not yet true and should be edited in the same spirit they were written:
|
||||
|
||||
- **ChangeProposal is documented as landed but does not exist.** CLAUDE.md states the core data model (ARCHITECTURE §5) landed / "Phase 0 complete," but `ChangeProposal` — part of §5 and the load-bearing AI invariant — has no model, migration, or schema. Either scope it out of the "landed" claim or build it; don't leave the docs asserting it.
|
||||
- **pgvector is claimed as used; it is not.** Only `pg_trgm` is created. ARCHITECTURE references pgvector for match ranking.
|
||||
- **i18n "from day one" is documented but unmet.** PRD §6 promises externalized strings; every label is a hardcoded literal.
|
||||
|
||||
These three doc edits are themselves trivial quick wins (see §3).
|
||||
These two doc edits are themselves trivial quick wins (see §3).
|
||||
|
||||
**The biggest gaps vs commercial (Ancestry / MyHeritage / FamilySearch).** Provenance is not trying to be a record provider, and correctly so — but it is missing several things mainstream users treat as table stakes:
|
||||
|
||||
@@ -40,19 +39,17 @@ These three doc edits are themselves trivial quick wins (see §3).
|
||||
|
||||
**The biggest gaps vs OSS (GRAMPS / Gramps Web / webtrees).** These are where a privacy-first self-host product is expected to compete and currently trails:
|
||||
|
||||
- **Collaboration is plumbed but unreachable.** `TreeMembership` roles are enforced on every read/write, but there is **no API or UI to invite, grant, change, or revoke** a member — the tree is effectively single-user despite multi-user infrastructure. This also breaks the full-CRUD invariant (NN#8) and, because importance and the old Phase-6 schedule disagree, a minimal management slice is pulled forward (§2.9).
|
||||
- **Living-person redaction is non-uniform.** Redaction is applied on person reads but **not** on the event/media/name/relationship/citation/source child-resource endpoints — a real PII leak on public/unlisted trees (NN#3, NN#2).
|
||||
- **`site_members` visibility tier is silently broken** (defined, selectable in UI, never handled in `can_view_tree`).
|
||||
- **Collaboration management is now reachable, but minimal.** `TreeMembership` roles are enforced on every read/write, and a list/add/change-role/remove API + UI now ship (§2.9), satisfying the full-CRUD invariant (NN#8). The remaining gap is the richer **email invite/grant flow** (pending-invite state, resend/expire), still scheduled for Phase 6.
|
||||
- **Living-person redaction is now near-uniform.** Non-member reads of persons, events, media, names, and relationships all redact possibly-living people (#46); the `citation`/`source` list endpoints are the remaining hold-outs (they gate only on `can_view_tree`) — a narrowed PII gap on public/unlisted trees (NN#3, NN#2).
|
||||
- **No place as a usable first-class entity** (model exists, created by GEDCOM, but no read/edit/delete — a create-only entity, which is a bug per NN#8).
|
||||
- **No research log, to-do/task planner, kinship calculator, data-quality checker, or i18n/string externalization** (the last is a documented day-one commitment that is currently unmet).
|
||||
|
||||
**Security-priority correctness fixes (do these first, regardless of phase).** Three current defects are user-harm or trust issues, not roadmap items:
|
||||
**Security-priority correctness fixes (do these first, regardless of phase).** Most of the original redaction defects shipped this cycle (#46); two items remain — one a narrowed PII gap, one a config switch:
|
||||
|
||||
1. **Media privacy leak (§2.4)** — `list_media`/`get_media`/`media_content` gate on `can_view_tree` but never `person_visibility`; non-owners can download photos of redacted living people on public/unlisted trees.
|
||||
2. **Child-resource redaction gap (§2.10)** — event/media/name/relationship/citation/source endpoints don't apply living-person redaction.
|
||||
3. **Registration issues a live session before verification (§2.10)** — `register` returns an authenticated session cookie + token (201) and `email_verified_at` is written but never read on any path; there is no env switch to gate self-registration. The *enforcement check* (read-side `email_verified_at`) is small; the approval-mode env switch is the larger piece.
|
||||
1. **Citation/source redaction gap (§2.10)** — `list_media`/`get_media`/`media_content`, plus the event/name/relationship endpoints, now apply `person_visibility` for non-members (#46), closing the media leak. The `citation`/`source` list endpoints still gate only on `can_view_tree`, so a non-member on a public/unlisted tree can still enumerate citations/sources tied to redacted living people — the remaining living-person leak.
|
||||
2. **Self-registration approval-mode switch (§2.10)** — the read-side enforcement now exists: `REQUIRE_EMAIL_VERIFICATION` gates login/session on `email_verified_at` (#53). The remaining gap is the env switch to choose open vs admin-approval vs closed self-registration.
|
||||
|
||||
**Strategic posture.** The differentiators worth pressing — property chain-of-title, the ChangeProposal AI model, the anonymous mutual-consent hint system, and true self-host data ownership — are mostly still ahead on the roadmap. The near-term job is (a) close the **privacy/auth correctness** and **collaboration** gaps that the architecture already implies, (b) ship the **maps + reports + merge** table stakes, and (c) build the **connector/ModelProvider/ChangeProposal** spine that unlocks the entire back half of the roadmap.
|
||||
**Strategic posture.** The differentiators worth pressing — property chain-of-title, the ChangeProposal AI model, the anonymous mutual-consent hint system, and true self-host data ownership — are mostly still ahead on the roadmap. The near-term job is (a) close the **privacy/auth correctness** and **collaboration** gaps that the architecture already implies, (b) ship the **maps + reports + merge** table stakes, and (c) finish the back-half spine — the **connector framework** plus wiring the now-landed **ChangeProposal/ModelProvider** into the assistant — that unlocks the entire back half of the roadmap.
|
||||
|
||||
---
|
||||
|
||||
@@ -129,11 +126,11 @@ Fuzzy trigram name search is **have**; everything that depends on connectors, em
|
||||
|
||||
### 2.4 Media & documents
|
||||
|
||||
Universal media attachment is **have**, but with a **confirmed privacy leak** and no asset-processing pipeline.
|
||||
Universal media attachment is **have**; the earlier privacy leak is now **closed** (#46), and the remaining gaps are the asset-processing pipeline (EXIF strip, thumbnails).
|
||||
|
||||
| Item | Description | Status | Imp | Eff | Phase | Non-negotiable |
|
||||
|---|---|---|---|---|---|---|
|
||||
| **Media privacy gating on serve paths** | `list_media`/`get_media`/`media_content` gate only on `can_view_tree`, never `person_visibility` — a non-owner can download photos of redacted living people on public/unlisted trees. | Have(leaky) | **Critical** | M | 1 | **Security-priority — fix first. Direct NN#3/NN#2 violation.** Check attached `person_id` visibility and redact/hide. |
|
||||
| **Media privacy gating on serve paths** | `list_media`/`get_media`/`media_content` now apply `person_visibility` for non-members (#46): media is exposed only when linked to a FULL-visibility person (`list_public_media`/`can_view_media`), so living-person photos no longer leak on public/unlisted trees. | Have | **Critical** | M | 1 | **Resolved (NN#3/NN#2).** Serve paths check attached `person_id` visibility and 404 otherwise. |
|
||||
| EXIF / GPS stripping on upload | Raw bytes stored verbatim; family photos leak GPS/home addresses/timestamps. | Planned | High | M | 1 | **Security-priority**, not cosmetic. Parse EXIF on ingest, strip/quarantine by default, allow override. |
|
||||
| Thumbnail / preview generation | No image pipeline (no Pillow). Async, idempotent worker job. | Planned | High | L | 1 | Derived thumbnail must inherit parent privacy — no bypass path. |
|
||||
| Image reference regions | Mark the rectangle of a census image that supports a Citation. | Missing | Med | M | later | Tenant-scoped, full CRUD; region→Citation preferred over region→Person. |
|
||||
@@ -224,14 +221,14 @@ The preview→approve **bulk cleanup** tool is a genuine **have** and a differen
|
||||
|
||||
### 2.9 Collaboration & sharing
|
||||
|
||||
Authorization is enforced everywhere, but the **management surface is entirely absent** — the most consequential gap relative to the multi-user product promise. Because the Critical items below previously sat at Phase 6 while their labels said "breaks NN#8," a minimal management slice is pulled forward to Phase 2; the richer invite/email UX stays at Phase 6.
|
||||
Authorization is enforced everywhere, and a **minimal management surface now ships** — list/add/change-role/remove via `api/v1/members.py` plus a members page (#233). The remaining gap is the richer email invite/grant flow. The minimal slice landed at Phase 2 as planned; the invite/email UX stays at Phase 6.
|
||||
|
||||
| Item | Description | Status | Imp | Eff | Phase | Non-negotiable |
|
||||
|---|---|---|---|---|---|---|
|
||||
| **Membership PATCH/DELETE + role change (minimal slice)** | Add/adjust/revoke a collaborator and change `role` — the substrate (mutable `role`) exists; only the endpoints are missing. Resolves the create-only NN#8 break without the full invite flow. | Partial | **Critical** | S–M | 2 | **Pulled forward** — a create-only entity shouldn't wait for Phase 6 (NN#8). Revocation routes through the single privacy point. |
|
||||
| **Membership PATCH/DELETE + role change (minimal slice)** | Add/adjust/revoke a collaborator and change `role` — GET/PATCH/DELETE on `/trees/{id}/members` (`api/v1/members.py`) plus a frontend members page now ship (#233). Resolves the create-only NN#8 break without the full invite flow. | Have | **Critical** | S–M | 2 | Resolves the create-only NN#8 break. Revocation routes through the single privacy point. |
|
||||
| Full invite/grant flow (email + UI) | Email-based invitations, pending-invite state, role-grant UI, resend/expire. Builds on the minimal slice. | Partial | High | L | 6 | Invitation email via configured SMTP (NN#7); membership changes through the one enforcement point. |
|
||||
| **Read-only public tree share** | Visibility model already redacts living persons for anonymous viewers, but every endpoint requires `CurrentUser` — no optional-auth dep, no public route, no public page. | Partial | High | M | 2 | Highest-leverage near-term sharing feature; living-safe by construction via `person_visibility` (NN#2/#3). |
|
||||
| SEO public profile pages (server-rendered) | Intent declared (`public` = search-indexable) but zero implementation; no sitemap/robots/meta. | Partial | Med | L | 2 | NN#2 explicitly names server-rendered public pages — must go through privacy engine, no direct row queries. |
|
||||
| **Read-only public tree share** | Anonymous read surface shipped: optional-auth `CurrentUserOrNone` dep, `api/v1/public.py` + `public_view_service.py`, and server-rendered pages at `/p/[treeId]` (+ `/persons/[personId]`) and `/explore`. Living-safe by construction via `person_visibility`. | Have | High | M | 2 | Highest-leverage near-term sharing feature; living-safe by construction via `person_visibility` (NN#2/#3). |
|
||||
| SEO public profile pages (server-rendered) | Server-rendered public pages (`/p/[treeId]`, `/explore`) and `robots.ts` now ship. Deferred follow-ups: a public-only `sitemap.ts` and per-tree `noindex,nofollow` meta for `unlisted`/`site_members` pages. | Partial | Med | L | 2 | NN#2 explicitly names server-rendered public pages — must go through privacy engine, no direct row queries. |
|
||||
| **Notification / event-dispatch substrate** | Shared enabler seeded from `AuditEntry`: subscription + dispatch layer emitting privacy-filtered projections. Underpins watch/follow, mutual-consent match notices, comments, moderation, and in-app messaging. | Missing | High | L | 6 | **Privacy-filtered projections only — never raw before/after JSON** (NN#2/#3). |
|
||||
| Comments / discussion threads | Per-profile discussion (target = person/event/source), threaded. | Missing | High | M | 6 | Comments on living persons redacted for non-members (NN#2/#3); rides the dispatch substrate. |
|
||||
| In-app messaging (contact details hidden) | SMTP exists; no Message/Thread model. | Planned | High | L | 6 | Hide contact details; opens after mutual consent (NN#4); redact living-person content; rides dispatch substrate. |
|
||||
@@ -253,10 +250,11 @@ The architecture is correct (single engine, tenant mixin, audit, soft-delete + p
|
||||
|
||||
| Item | Description | Status | Imp | Eff | Phase | Non-negotiable |
|
||||
|---|---|---|---|---|---|---|
|
||||
| **Uniform living-person redaction across child resources** | `_redact` runs on person reads but **not** on event/media/name/relationship/citation/source endpoints — non-members fetch a possibly-living person's events/photos/names directly. | Partial | **Critical** | M | 1–2 | **Security-priority. Core NN#3/NN#2 defect.** Apply `person_visibility` on every person-derived fact. |
|
||||
| **Email-verification enforcement gate** | `email_verified_at` is written at `auth_service.py:154` but read on no path; `register` returns an authenticated session cookie + token (201) pre-verification. | Partial | **High** | S | 1–2 | **Security-priority near-quick-win** — add the read-side check (NN#7 trust path). The check is small; the registration-mode switch below is the larger piece. |
|
||||
| **Uniform living-person redaction across child resources** | `person_visibility` now runs for non-members on the event, media, name, and relationship endpoints (#46), which delegate to `public_view_service`. Remaining: the `citation`/`source` list endpoints still gate only on `can_view_tree`, so citations tied to a redacted living person are still enumerable. | Partial | High | S | 1–2 | **Mostly resolved (NN#3/NN#2).** Apply `person_visibility` to the citation/source list paths to close the residual leak. |
|
||||
| **Email-verification enforcement gate** | Read-side check now ships (#53): `REQUIRE_EMAIL_VERIFICATION` gates login/session on `email_verified_at` (`auth_service.py`). Opt-in (default off) so SMTP-less self-hosts still work. | Have | **High** | S | 1–2 | Read-side trust path now enforced (NN#7); the registration-mode switch below is the separate larger piece. |
|
||||
| Self-registration mode gating (approve / open / closed) | No env switch to choose open vs admin-approval vs closed registration. | Partial | High | M | 2/5 | Twelve-factor registration control (NN#7); pairs with the verification gate above. |
|
||||
| **Fix `site_members` visibility tier** | Defined + selectable in UI but `can_view_tree` only handles public/unlisted — fails closed unintuitively. | Partial | Critical | S | 1 | **Quick win.** Least-surprise; honor the tier the UI offers. |
|
||||
| Instance owner / operator role | `OWNER_EMAIL`-declared operator (#240): `is_instance_owner` on `/users/me`, owner-only `GET /api/v1/admin/instance`, `/admin` UI. | Have | Med | S | 2/5 | Owner-only operational surface, twelve-factor via env (NN#7); reads stay through the service layer. |
|
||||
| **Fix `site_members` visibility tier** | `can_view_tree` now handles `site_members` (`privacy.py:56`): any authenticated account gets a read view, anonymous is refused. | Have | Critical | S | 1 | Honors the tier the UI offers; reads still route through `person_visibility`. |
|
||||
| Make `LIVING_RECENCY_YEARS` configurable | Hardcoded 100 at `privacy.py:23`. | Partial | High | S | 2 | **Quick win.** Twelve-factor (NN#7). |
|
||||
| Privacy-stripped export (redact living) | GEDCOM + account export emit full tree; no "strip living" mode. | Missing | High | M | 2 | Reuse `person_visibility`/`_redact` (NN#3). Owner self-export is safe today; shareable variant is the gap. |
|
||||
| Per-fact / per-field privacy + record flags | tentative/rejected/preferred/private flags on facts. | Missing | Med | L | later | If added, route through the single engine (NN#2). |
|
||||
@@ -270,11 +268,11 @@ The architecture is correct (single engine, tenant mixin, audit, soft-delete + p
|
||||
|
||||
### 2.11 Import/export & standards
|
||||
|
||||
GEDCOM 5.5.1 import/export and full data-portability export are **have**, but fidelity gaps directly undercut the provenance thesis — and one is outright data loss.
|
||||
GEDCOM 5.5.1 import/export and full data-portability export are **have**; the remaining fidelity gaps (custom tags, PLAC coords/hierarchy, non-UTF-8 encoding) still undercut the provenance thesis.
|
||||
|
||||
| Item | Description | Status | Imp | Eff | Phase | Non-negotiable |
|
||||
|---|---|---|---|---|---|---|
|
||||
| **Citation links dropped on GEDCOM export** | Export never selects the Citation table — fact→source links, page, detail, confidence all dropped on export (they import fine). Re-importing your own export **destroys** the sources-first graph. | Partial | **Critical** | M | 2 | **Silent data loss on the product's signature data + destructive round-trip** (NN#5); breaks PRD US-013. |
|
||||
| **Citation links on GEDCOM export** | Export now selects Citations and emits `SOUR`/`PAGE` per fact (#232), so fact→source links survive a Provenance→Provenance round-trip. (Citation detail/confidence beyond page still to round-trip.) | Have | **Critical** | M | 2 | Closes the silent data-loss / destructive round-trip on the product's signature data (NN#5); satisfies PRD US-013. |
|
||||
| GEDCOM 7.0 import/export | Version hardcoded `5.5.1`; no v7 semantics, SCHMA, SUBM, or UID handling. | Partial | High | L | 2 | Stated differentiator (FamilySearch interop). |
|
||||
| Custom/underscore tag preservation | `_MARNM` becomes `TYPE married`, other custom tags dropped — violates ≥99% round-trip goal. | Missing | High | L | 2 | Tension with provenance thesis (faithful record). |
|
||||
| PLAC FORM hierarchy + MAP coordinate round-trip | Import reads only PLAC text; export emits flat PLAC. lat/long + hierarchy lost on round-trip. | Missing | High | M | 2–3 | Round-trip fidelity for the land/maps pillar. |
|
||||
@@ -309,7 +307,7 @@ Internal REST + OpenAPI + generated TS client are **have**. The externalized dev
|
||||
|
||||
| Item | Description | Status | Imp | Eff | Phase | Non-negotiable |
|
||||
|---|---|---|---|---|---|---|
|
||||
| Public read-only API + scoped tokens (OAuth) | Bearer token is opaque session only; `TokenPurpose` lacks scopes; designed `public.py` never built. | Partial | High | L | 5–6 | Any scoped-token path routes through `person_visibility` + living-person redaction (NN#2/#3). |
|
||||
| Public read-only API + scoped tokens (OAuth) | The unauthenticated public read surface (`public.py`) now ships (#41–#51), but for a *developer* API the bearer token is still opaque session only and `TokenPurpose` lacks scopes — no scoped/OAuth token path. | Partial | High | L | 5–6 | Any scoped-token path routes through `person_visibility` + living-person redaction (NN#2/#3). |
|
||||
| SourceConnector framework | Only AuthProvider/ObjectStore/Mailer base classes exist; no connector base/loader/registry. Gates AI, hints, property connectors. | Planned | Med | L | 4 | Read-only, rate-limited; findings via ChangeProposal (NN#1); legal sources only (NN#6). |
|
||||
| Webhooks / change feeds | `AuditEntry` is the natural substrate (shares the notification dispatch layer, §2.9); no feed/webhook layer. | Missing | Med | L | 6 | Emit privacy-filtered, tenant-scoped projections — never raw before/after JSON (NN#2/#3). |
|
||||
| CLI / scripting surface | No `[project.scripts]`, no Typer/Click; worker is a purge loop only. Self-hosters want bulk admin. | Missing | Med | M | 9 | Funnel reads through privacy.py, writes through audit; admin-scoped, no assistant-write path. |
|
||||
@@ -328,7 +326,7 @@ Postgres + S3, multi-tenant isolation are **have**. Queue, observability, backup
|
||||
| Real job queue (Postgres/Redis-backed) | Worker is a fixed-interval purge loop; GEDCOM import and account export run **inline in the request**. | Partial | High | L | 4 (pre-req) | Blocks NN#1 (assistant in worker) and NN#4 (hint matching in worker). Queue backend is an open question (PRD §11). |
|
||||
| **Pagination on list endpoints + server-side tree loading** | List endpoints (`persons.py:37`, events, relationships) take no `limit/offset/skip`; the tree view loads the whole graph client-side. A *current* limitation against the 50k-person target. | Planned | High | M | 1–2 | **Split out from scale validation** — this is a correctness/functional gap now, not a Phase 9 task. |
|
||||
| Scale validation (50k+ trees, P95<2s, load test) | No benchmark or load test exists. | Planned | High | L | 9 | Inline heavy ops risk partial writes — moving to the queue is what makes "failures never corrupt state" true. |
|
||||
| **Operator backup: one-command `pg_dump` + MinIO sync** | Only a documented procedure + per-account ZIP exist; no scripted DB+object dump. For a self-host product this is day-one data-loss exposure. | Partial | Critical | M | 1–2 | **Pulled forward** — Critical importance contradicted the old Phase-9 slot. Restore must re-apply privacy state faithfully (NN#3); safety net for NN#8. |
|
||||
| **Operator backup: one-command `pg_dump` + MinIO sync** | `deploy/backup.sh` + `deploy/BACKUP.md` now provide a scripted DB+object dump (#234). Remaining: scheduled/off-host/verified-restore tooling (row below). | Have | Critical | M | 1–2 | Restore must re-apply privacy state faithfully (NN#3); safety net for NN#8. |
|
||||
| Scheduled / cloud automated backup + restore tooling | Cron-driven, off-host, verified-restore workflow. | Partial | High | L | 9 | Builds on the one-command slice above. |
|
||||
| ARM64 build matrix | CI builds `linux/amd64` only; many self-hosters run ARM SBCs. | Partial | High | S | 1 | **Quick win.** Add arm64 + QEMU to buildx (NN#7 container-native). |
|
||||
| Structured JSON logs + Prometheus metrics | Plain-text stdlib logging; no `/metrics`. | Partial | Med | M | 9 | Logs/metrics reference UUIDs, never names/PII (NN#3/#4). |
|
||||
@@ -361,12 +359,13 @@ The entire "land" half is **planned/missing** but fully specified. This is where
|
||||
|
||||
### 2.16 AI assistant — *defining differentiator*
|
||||
|
||||
Entirely **planned** — and note the docs-vs-code gap: ARCHITECTURE §5 lists `ChangeProposal` as part of the "landed" core model, but no model/migration/schema exists. The audit substrate (`actor_type=assistant`, before/after JSONB) is the right foundation; the ChangeProposal model and ModelProvider abstraction are the two critical-path pieces.
|
||||
The spine has now **landed**: the `ChangeProposal` model/schema/service, its migration, the GET/POST API, and a review UI all ship, and the `LLMProvider`/`EmbeddingProvider` abstraction with null/Anthropic/OpenAI-compat (OpenAI/xAI/Ollama) providers + registry is in place. The audit substrate (`actor_type=assistant`, before/after JSONB) is the right foundation; the remaining work is wiring the assistant's tools to emit proposals and building the chatbot/RAG surface on top.
|
||||
|
||||
| Item | Description | Status | Imp | Eff | Phase | Non-negotiable |
|
||||
|---|---|---|---|---|---|---|
|
||||
| **ChangeProposal (propose-then-confirm)** | The defining invariant. No `proposal.py`, no migration, no review UI yet — despite docs implying it landed. | Planned | **Critical** | L | 4 | **IS NN#1.** Enforce structurally: assistant tools return proposals; only user action applies one; application flows through the normal service layer (privacy + audit). ChangeProposal itself needs full CRUD (NN#8). Correct the docs to match reality. |
|
||||
| Pluggable LLM + embedding provider | `ModelProvider` over Anthropic/OpenAI/xAI/Ollama; env placeholders exist, no interface code. | Planned | Critical | M | 4 | **Twelve-factor, no hard-coded keys/endpoints** (NN#7); the Ollama/self-hosted path is what makes the privacy-first promise real. |
|
||||
| **ChangeProposal (propose-then-confirm)** | The defining invariant. Model/schema/service (`models/change_proposal.py`, `services/change_proposal_service.py`), migration `a1b2c3d4e5f6`, GET/POST `api/v1/proposals.py`, and a `/trees/[id]/proposals` review UI all ship. Remaining: wire assistant tools to emit proposals. | Have | **Critical** | L | 4 | **IS NN#1.** Enforce structurally: assistant tools return proposals; only user action applies one; application flows through the normal service layer (privacy + audit). ChangeProposal itself needs full CRUD (NN#8). |
|
||||
| Pluggable LLM + embedding provider | `LLMProvider`/`EmbeddingProvider` ABCs (`integrations/models/base.py`) with null, Anthropic, and OpenAI-compat (OpenAI/xAI/Ollama) implementations + registry. | Have | Critical | M | 4 | **Twelve-factor, no hard-coded keys/endpoints** (NN#7); the Ollama/self-hosted path is what makes the privacy-first promise real. |
|
||||
| Per-tree AI model policy | Owner-only per-tree model selection (`Tree.ai_member_provider`/`ai_recommender_provider`, GET/PATCH `/trees/{id}/ai`, `/trees/[id]/ai` UI) (#238). | Have | Med | S | 4 | Owner-only; selects which configured provider a tree uses — keys stay in env, twelve-factor (NN#7). |
|
||||
| AI research-assistant chatbot (RAG over tree) | Marquee feature; needs ModelProvider + connector + retrieval through privacy engine. | Planned | High | XL | 4 | NN#1 propose-only, NN#2 privacy retrieval, NN#3 redaction. |
|
||||
| Conversational / connector record search | Search legal sources via the assistant. | Planned | High | L | 4 | Legal sources (NN#6); findings = Source + Citation (NN#5). |
|
||||
| Fact extraction from documents | Extracted facts map cleanly to ChangeProposal review. | Missing | Med | M | 4 | Canonical NN#1 use case; each fact carries a Citation (NN#5). |
|
||||
@@ -399,8 +398,8 @@ A documented **day-one commitment** ("UI strings externalized from day one") tha
|
||||
|
||||
Ordered by leverage. All are S-effort or a thin slice of a larger item, and most close a stated invariant gap.
|
||||
|
||||
1. **Fix `site_members` visibility tier** (Privacy, Critical/S) — defined and selectable in the UI but never handled in `can_view_tree`; fails closed unintuitively.
|
||||
2. **Email-verification enforcement gate** (Privacy/Auth, High/S) — add the read-side `email_verified_at` check so a freshly registered, unverified user doesn't get a live authenticated session. Security-priority; the registration-mode env switch (open/approve/closed) is the larger follow-on, not part of this quick win.
|
||||
1. **Fix `site_members` visibility tier** (Privacy, Critical/S) — **done:** `can_view_tree` now handles `site_members` (`privacy.py:56`), giving any authenticated account a read view while refusing anonymous.
|
||||
2. **Email-verification enforcement gate** (Privacy/Auth, High/S) — **done (#53):** the read-side `email_verified_at` check now ships behind `REQUIRE_EMAIL_VERIFICATION`, so a freshly registered, unverified user doesn't get a live authenticated session. The registration-mode env switch (open/approve/closed) is the larger follow-on (§2.10, M-effort — not a quick win).
|
||||
3. **Citation confidence selector in the cite form** (Sources, High/S) — confidence is modeled and API-writable but unreachable in the UI; every UI citation is currently NULL. Honors NN#8 and the evidence-quality thesis.
|
||||
4. **Source edit UI + expose all 8 fields** (Sources, High/S) — update API exists but there is no edit form and create exposes ~3 fields; a create-but-not-edit entity violates NN#8.
|
||||
5. **Make `LIVING_RECENCY_YEARS` env-configurable** (Privacy, High/S) — hardcoded 100 at `privacy.py:23`; twelve-factor (NN#7).
|
||||
@@ -411,11 +410,9 @@ Ordered by leverage. All are S-effort or a thin slice of a larger item, and most
|
||||
10. **`GET /{tree}/citations/{id}` endpoint** (Sources, Med/S) — API symmetry (NN#8).
|
||||
11. **Transcription/abstract fields on Source** (Sources, Med/S) — add `transcription_text` + `abstract_text`, distinct from `citation_text`; core to evidence analysis.
|
||||
12. **Sort the merged person timeline** (Research workflow, Med/S) — `shownEvents.sort()` on `date_start`; currently appended unsorted.
|
||||
13. **Doc corrections (docs-vs-code)** (Meta, trivial/S) — edit CLAUDE.md / ARCHITECTURE so the pgvector "used" claim, the i18n "from day one" claim, and the ChangeProposal "landed" claim match reality. The repo convention requires docs to travel with code.
|
||||
13. **Doc corrections (docs-vs-code)** (Meta, trivial/S) — edit CLAUDE.md / ARCHITECTURE so the pgvector "used" claim and the i18n "from day one" claim match reality. The repo convention requires docs to travel with code.
|
||||
|
||||
> **Ships-with, not standalone:** *Revocable / adjustable access (membership PATCH/DELETE + role change)* is security-critical and S-effort, but it is the minimal slice of the membership work (§2.9) and ships **with** those endpoints — it is not independently shippable on its own.
|
||||
>
|
||||
> **Higher priority than any quick win, but M-effort (not quick):** the **media privacy leak** (§2.4), the **child-resource redaction gap** (§2.10), and pulling the **one-command operator backup** (§2.14) forward. Treat these as **security-/data-loss-priority Phase 1–2 fixes** regardless of the quick-win list.
|
||||
> **Mostly shipped this cycle (#46):** the **media privacy leak** (§2.4) and the broad **child-resource redaction gap** (§2.10) are now closed for the person/event/media/name/relationship endpoints. The narrowed remainder — applying `person_visibility` to the **citation/source list endpoints** — is an S-effort follow-up; treat it as a security-priority Phase 1–2 fix regardless of the quick-win list.
|
||||
|
||||
---
|
||||
|
||||
@@ -425,10 +422,10 @@ Where to invest to make Provenance distinct rather than a webtrees clone. Each l
|
||||
|
||||
**1. Property chain-of-title (the "land" half).** No surveyed competitor models ownership as a typed, cited event chain tying parties across time, with gap-flagging and bidirectional owner↔person / parcel↔place traversal, fed by **legal** public sources (BLM/GLO patents, USGS, public county deeds). This is the single clearest "no one else does this" capability. Sequence: Property + OwnershipEvent + Citation-target (Phase 3) → chain-of-title view → BLM/GLO connector (Phase 8). The Citation extension is a quick win; the entity is the prerequisite for everything else in the category.
|
||||
|
||||
**2. The ChangeProposal AI model.** "The assistant never writes autonomously" is a *trust* differentiator in a market where users fear AI corrupting their research. Build it structurally — assistant tools return proposals; only an explicit human action applies one; application flows through the normal service layer so it always hits the privacy engine and audit log. The same approval queue moderates untrusted human-contributor edits (Collaboration §2.9), so design them together. The audit substrate is already in place; ChangeProposal + ModelProvider are the critical path — and the docs should stop asserting ChangeProposal has landed until it has.
|
||||
**2. The ChangeProposal AI model.** "The assistant never writes autonomously" is a *trust* differentiator in a market where users fear AI corrupting their research. The structural spine has **landed** — the `ChangeProposal` model/API/review UI and the pluggable `LLMProvider`/`EmbeddingProvider` abstraction both ship — so the remaining work is wiring the assistant's tools to emit proposals (never mutating directly). Assistant tools return proposals; only an explicit human action applies one; application flows through the normal service layer so it always hits the privacy engine and audit log. The same approval queue moderates untrusted human-contributor edits (Collaboration §2.9), so design them together.
|
||||
|
||||
**3. Anonymous, mutual-consent cross-tree hints.** The privacy model already redacts living people for anonymous viewers, so a hint system that reveals *nothing identifying* until both sides opt in is achievable by construction — and is a categorically more trustworthy version of MyHeritage Smart Matches / Ancestry hints. Requires the matching engine (pgvector enablement + candidate generation, Phase 7), the notification/event-dispatch substrate (§2.9), and the messaging channel that opens only post-consent.
|
||||
|
||||
**4. True self-hosting + data ownership.** Full account export/import, soft-delete recovery, GEDCOM round-trip, env-driven everything, and (to-build) operator-grade scheduled backup + ARM support make Provenance the genealogy app you actually own. Two correctness items gate the promise: GEDCOM export must stop dropping citations (a Provenance→Provenance round-trip currently destroys the sources graph), and operator backup must move from "documented procedure" to a one-command dump. The Ollama/self-hosted ModelProvider path means even the AI assistant runs without tree data leaving the deployment — a promise no commercial competitor can make.
|
||||
**4. True self-hosting + data ownership.** Full account export/import, soft-delete recovery, GEDCOM round-trip, env-driven everything, a one-command operator backup, and (to-build) scheduled off-host backup + ARM support make Provenance the genealogy app you actually own. The two correctness items that gated the promise have **landed**: GEDCOM export now preserves citations (the Provenance→Provenance round-trip keeps the sources graph), and operator backup moved from "documented procedure" to a one-command dump (`deploy/backup.sh`). What remains is scheduled/verified-restore tooling and ARM builds. The Ollama/self-hosted ModelProvider path means even the AI assistant runs without tree data leaving the deployment — a promise no commercial competitor can make.
|
||||
|
||||
**5. Sources-first as a felt experience.** The two-tier model is built; the differentiator is making it *visible and low-friction*: a guided Evidence-Explained citation builder, transcription/abstract fields, source-driven data entry (transcribe a document into the tree), per-fact confidence surfaced in the UI, and — critically — citations that **survive GEDCOM export**. These turn "every fact links to where it came from" from an architecture note into the product's personality.
|
||||
**5. Sources-first as a felt experience.** The two-tier model is built, and citations now **survive GEDCOM export** (#232); the remaining differentiator is making sourcing *visible and low-friction*: a guided Evidence-Explained citation builder, transcription/abstract fields, source-driven data entry (transcribe a document into the tree), and per-fact confidence surfaced in the UI. These turn "every fact links to where it came from" from an architecture note into the product's personality.
|
||||
|
||||
Reference in New Issue
Block a user