# Provenance — Product Requirements Document

**Status:** Draft v0.1 — now describes a partially-implemented system: Phase 0 complete, Phase 1 done, with early slices of later phases shipped.
**Owner:** Justin Paul
**Last updated:** 2026-06-10

---

## 1. Problem & context

People who care about where they come from have two real options today: pay an ongoing subscription to a big commercial genealogy platform (Ancestry, MyHeritage, Findmypast), or wrangle a desktop GEDCOM tool from the 2000s. The commercial platforms gate the best features behind paywalls, lock records to their ecosystem, and treat family data as a corporate asset. The legacy desktop tools are single-user, hard to share, and were not built for modern collaboration or for running as a service.

Meanwhile the people doing this work have changed. People now in their thirties through fifties — the ones who became keepers of the family photos after a parent downsized or passed — expect software that feels like the modern tools they already use, not Family Tree Maker 2007. They want something that respects their data, runs where they choose, and treats sources, photos, and stories as first-class.

There is also an adjacent problem the genealogy tools ignore entirely: **chain of ownership for property.** In much of the country, farm and rural families trace their land back to original federal patents, through a century or more of deeds, inheritances, and sales. Genealogy tools have no concept of property; property-research tools have no concept of family. Provenance treats both as facets of the same question — the documented story of where something came from — which is exactly what the word *provenance* has always meant.

## 2. Goals & non-goals

### Goals

- A modern, mobile-friendly web application a casual user can be productive in within ten minutes, without dismissing the needs of serious researchers
- Treat **sources** as a first-class concept; every fact can link to where it came from
- Support **property ownership history** (chain of title) alongside people, places, and events
- **Multi-tenant**: any number of users coexist on one deployment, each with their own trees
- An **AI research assistant** that searches legal sources and proposes data for user approval — never autonomous writes
- A **cross-tree hint system** that anonymously surfaces possible matches between users with overlapping family, with consent required before any identity is revealed
- **Privacy controls** at tree and person level, with living-person protection on by default
- **Direct messaging** between users for collaboration
- **GEDCOM 7** import and export
- **Self-hostable** and container-native; deployable behind Caddy and (preferred) a Cloudflare Tunnel
- **CI/CD** via Gitea Actions, building images to the Gitea container registry for servers to pull

### Non-goals (for now)

- **DNA matching, ethnicity estimation, or raw DNA ingestion.** Schema stays future-friendly, but this is out of scope until there is demand from real users.
- **Native mobile apps.** Responsive web only for now; native iOS/Android may follow.
- **Per-fact privacy controls.** Privacy is tree- and person-level only; per-fact adds UI complexity for marginal value at this stage.
- **International property record formats** (Torrens title, continental cadastral systems). The data model is country-agnostic, but v1 forms, scrapers, and validators target the US.
- **Scraping of paywalled or terms-prohibited sources** (Ancestry, MyHeritage, 23andMe). The framework supports new scrapers; the project ships only legally permissible ones.
- **A first-party hosted SaaS.** Possible later; would require its own business model. Code is BUSL-1.1 (non-commercial) precisely to keep that door open.

## 3. Target users & personas

### Primary — the Family Keeper (casual)

Mid-thirties to late-fifties. Became the de facto family historian. Has a shoebox of photos and stories and maybe a half-built Ancestry tree they resent paying for. Cares about getting names and dates right but mostly wants the *stories* preserved. Limited patience for jargon. Comfortable with web apps; not a developer.

*Goals:* build a coherent two-or-three-generation tree on each side, attach photos and key documents, share with siblings and cousins, and not lose it all when they stop paying a service.

### Secondary — the Serious Researcher

Hobbyist or retiree with twenty-plus years of accumulated research, thousands of people, and strong opinions about citation quality. Frustrated by tools that hide the citation layer or impose rigid source templates.

*Goals:* capture full, well-graded citations; move existing GEDCOM data in cleanly; use AI to chase leads while keeping tight control over what enters the tree.

### Tertiary — the Property Researcher (often overlaps the above)

Farmers, rural heirs, and local-history enthusiasts who care about land as much as lineage. Wants to trace a parcel from original patent through every owner to the present and tie it to the family.

*Goals:* reconstruct chain-of-title from deeds and patents; link land history to family; use one tool for both.

### Operator — the Self-Hoster (technical)

Runs services in a home lab behind a reverse proxy and tunnel. Wants a clean container drop-in that uses standard infrastructure and respects their time.

*Goals:* deploy with one compose file and an env file; plug in their own auth, mail, storage, and model providers; upgrade safely by pulling new images.

### Onboarding adaptation

At account creation, users pick a self-description (Family Keeper / Serious Researcher / Property Researcher). This tailors default views, density, and which advanced features are surfaced. It is a presentation hint, not a permission boundary, and can be changed anytime.

## 4. Use cases & user stories

Acceptance criteria (AC) are written to be testable.

### Accounts & onboarding
- **US-001** As a visitor, I create an account via any configured method (local password, Google, Apple, Facebook, or OIDC such as Authentik). *AC:* account created; email verified where applicable; session established.
- **US-002** As a new user, I choose a persona so the UI adapts. *AC:* persona stored; default workspace reflects it; changeable in settings.

### Trees, people, relationships
- **US-010** I create a tree with a name and default privacy and become its owner. *AC:* tree created; I am owner; default privacy applied to new records.
- **US-011** I add a person with at least a name, plus optional vitals, gender, and notes. *AC:* person persisted and visible in the tree.
- **US-012** I define relationships (parent/child, spouse/partner, sibling), including adoptive, step, foster, and donor relationships. *AC:* relationships render correctly and survive GEDCOM round-trip.
- **US-013** I attach a source to any fact. *AC:* fact displays its citation; source detail reachable in one click.

### Sources & media
- **US-020** I create a source (title, repository, citation text, optional URL) and reuse it across many facts. *AC:* one source can back many facts; editing it updates all references.
- **US-021** I upload media (image, scan, PDF, audio) and attach it to people, sources, events, or properties. *AC:* media stored in object storage; thumbnail/preview generated; attachment links resolve.

### Property & chain of title
- **US-030** I create a property with a legal description (metes-and-bounds prose or PLSS township/range/section), optional coordinates, and a lifespan. *AC:* property persisted; legal description preserved verbatim.
- **US-031** I record an ownership event (grant/patent, purchase, sale, inheritance, gift, tax sale, foreclosure, eminent domain), with date, parties, optional consideration, and a source. *AC:* event links property to person(s) and to its source.
- **US-032** I view a property's chain of title from first grant to present. *AC:* ownership events render in chronological order with gaps flagged.
- **US-033** I view every property a person held, and every parcel ever recorded at a place. *AC:* both reverse lookups return correct sets.

### Privacy & sharing
- **US-040** I set a tree to one of four visibility levels — private, unlisted, site_members, or public. *AC:* visibility enforced for anonymous and non-owner users; at the **site_members** level the tree is visible to any authenticated instance user (signed in but not a member of the tree) and hidden from anonymous visitors.
- **US-041** I mark any individual private even within a public tree. *AC:* that person's details hidden from non-owners regardless of tree setting.
- **US-042** Living people are hidden from non-owners by default. *AC:* a person with no death fact and a plausibly-living birth date shows only minimal/no PII to non-owners; owner can override per person.
- **US-043** I add a co-owner to a tree. *AC:* co-owner can edit per role; action attributed to them in the audit log.
- **US-044** I share a public tree via link. *AC:* link renders a read-only, living-person-safe view without requiring an account.

### Hints, matching & messaging
- **US-050** When another user's tree overlaps mine on a probable same individual, I get an anonymized possible-match notification. *AC:* notification reveals no identifying detail about the other user or their living people.
- **US-051** A match becomes a connection only when both users opt in. *AC:* identities/data shared only after mutual consent.
- **US-052** I message another connected user. *AC:* threaded direct messages; notification on new message.

### AI research assistant
- **US-060** I ask the assistant about a person or property; it searches configured legal sources and presents findings with provenance. *AC:* results cite their source; nothing is written to the tree yet.
- **US-061** The assistant proposes changes as a reviewable diff (add/edit/link). *AC:* I can approve, edit, or reject each proposed change; only approved changes are written.
- **US-062** Every proposal (approved or rejected) is recorded. *AC:* audit log shows actor = assistant, the proposal, and my decision.
- **US-063** I configure the model provider (Anthropic, OpenAI, xAI, or a self-hosted/local model) and supply keys. *AC:* assistant uses the selected provider; on my deployment, operator-supplied keys are used.
- **US-064** An operator enables/disables scraper sources and can add new ones. *AC:* scrapers are pluggable; disabled sources are never queried.

### Interop
- **US-070** I import a GEDCOM 7 file. *AC:* people, relationships, events, and sources import without data loss; an import report flags anything unmapped.
- **US-071** I export a tree as GEDCOM 7 anytime. *AC:* export re-imports into Provenance (and at least one third-party tool) without loss of core data.

### Soft delete & recovery
- **US-080** I soft-delete a person, property, or tree; it is hidden but recoverable for 30 days. *AC:* item leaves normal views; appears in a recovery area with deletion timestamp.
- **US-081** After 30 days, soft-deleted items are purged. *AC:* a scheduled job permanently removes items past the window; purge is logged.

### Search
- **US-090** I search across my trees by name (with spelling/variant tolerance), date, and place. *AC:* fuzzy name matches surface (e.g., Mueller/Müller/Muller); results respect privacy.

### Operator
- **US-100** I deploy via one compose file and an env file. *AC:* a documented compose stack brings up app, database, and object storage.
- **US-101** I configure auth providers, SMTP, object storage, database, and model providers via config. *AC:* all are environment-driven; no code changes needed.

## 5. Functional requirements

### 5.1 Identity & access
- Pluggable authentication: local password (with email verification and reset), social sign-in (Google, Apple, Facebook), and generic **OIDC** (validated against Authentik; should work with Keycloak, Authentik, Auth0, etc.). Operators enable any subset.
- Roles per tree: **owner**, **co-owner/editor**, **viewer**. Public/unlisted trees also have an implicit anonymous viewer.
- **Instance owner/operator:** an env-declared operator role (via `OWNER_EMAIL`, requiring a verified email), distinct from the per-tree roles. It is an operations/config role only and is **not** a privacy bypass — it grants no access to others' tree data or PII.
- The AI assistant acts as a distinct, scoped principal bound to the user it is helping — it can never exceed that user's rights, and its actions are separately attributable.

### 5.2 Data model (core entities)
- **Person** — names (with parts and variants), vitals, gender, notes, living/deceased status.
- **Relationship** — typed edges between people, including non-traditional and complex families.
- **Event** — typed, dated, placed facts (birth, marriage, residence, etc.) for people and families.
- **Place** — a gazetteer-style entity with hierarchy and **historical name variants** and date ranges (e.g., Königsberg → Kaliningrad), so records sort correctly across renamings and border changes.
- **Source** & **Citation** — a source can be cited by many facts; citations carry locality (page, entry) and optional quality grading.
- **Media** — binary assets in object storage with metadata; EXIF location/PII strippable on upload.
- **Property** — identity (with informal and legal names), lifespan, legal description, optional geometry; future-friendly for parcels that split or merge over time.
- **Ownership event** — typed event linking Property and Person(s) across time, with consideration and source.
- All entities support soft delete and appear in the audit log.

### 5.3 Sources-first behavior
- Any fact may carry one or more citations. The UI makes adding a source low-friction and visibly rewards it (a "sourced" indicator).
- Sources are reusable objects, not free-text strings, so a single deed or census page backs many facts.
- The assistant must attach provenance to everything it proposes.

### 5.4 GEDCOM 7
- Import and export. Import produces a mapping report; unmapped/extension tags are preserved where possible rather than dropped. Round-trip fidelity is a tracked success metric (§7).

### 5.5 Privacy engine
- Effective visibility = function(tree visibility, person override, living status, viewer role).
- Tree visibility has four levels: **private** (members only; default), **unlisted** (anyone with the link, not listed/indexed), **site_members** (any authenticated instance user), and **public** (anonymous + listed/indexable).
- Living-person rule: absent a death fact and within a configurable recency window (default ~100 years from birth, or unknown birth treated as possibly-living), non-owners see minimal or no PII.
- Public/link views must render through the same privacy engine — no bypass path.

### 5.6 Collaboration
- Co-ownership with roles; optimistic editing with conflict surfacing; full audit log (actor, timestamp, before/after).
- Direct messaging between connected users; messaging is also the channel for confirmed hint matches.

### 5.7 Hint & matching engine
- Runs asynchronously. Candidate generation uses fuzzy name + vitals + place; ranking may use embeddings (see §6) to reduce false positives.
- **Privacy-preserving by construction:** a match notification reveals no identifying data about the other party or their living individuals. Reveal occurs only on mutual opt-in.

### 5.8 AI research assistant
- Provider-agnostic abstraction over hosted models (Anthropic, OpenAI, xAI) and self-hosted/local models (e.g., an OpenAI-compatible endpoint or Ollama).
- Operators register one or more model providers (env / registry); a tree owner then selects the active provider(s) for that tree via an owner-only AI settings surface.
- Tool-mediated access to the same CRUD operations a user has, scoped to that user, via a server with explicitly scoped capabilities (an MCP-style tool boundary).
- **Propose-then-confirm is mandatory.** The assistant drafts changes as diffs; nothing persists without explicit user approval.
- Source connectors are a **plugin framework**; the project ships only legal sources (e.g., FamilySearch API, Find A Grave, WikiTree, BLM/GLO land patents, USGS maps, public-domain newspapers, public county records). Operator-supplied scrapers can be added later.

### 5.9 Search
- Cross-tree search with fuzzy name matching (e.g., Postgres trigram), date and place filters, all filtered through the privacy engine.

### 5.10 Notifications
- In-app and email (transactional) for new messages, hint matches, co-owner invites, and import completion. Email uses operator-configured SMTP.

### 5.11 Administration & operations
- All integration points (auth, SMTP, object storage, database, model providers, scrapers) are environment/config-driven.
- Health endpoints; structured logs; a documented backup/restore procedure; safe upgrade via image pull + migration.
- Owner-only operator surface: instance status and configuration (`GET /api/v1/admin/instance` and the `/admin` UI), scoped to the instance owner and exposing no tree contents or PII.

## 6. Non-functional requirements

- **Privacy & security:** TLS in transit; encryption at rest for sensitive fields and media where supported; least-privilege AI principal; immutable audit log; no PII in URLs; privacy engine has no bypass.
- **Performance & scale:** usable trees of 50k+ people; typical page interactions P95 < 2s; long operations (import, scraping, matching) run async with progress; assistant responses stream.
- **Reliability:** graceful degradation when a model provider or scraper is unavailable; failures never corrupt the tree; background jobs are retryable.
- **Accessibility:** target WCAG 2.2 AA.
- **Internationalization:** UI strings externalized from day one (translations later); Unicode-correct handling of non-Latin names and historical/non-Gregorian dates.
- **Mobile:** responsive, touch-first; functional parity with desktop on small screens.
- **Operability:** single compose stack; twelve-factor config; reproducible container images from CI.

## 7. Success metrics

- **Time to first useful action** for a new casual user: < 10 minutes (target < 5).
- **GEDCOM round-trip fidelity:** ≥ 99% of core records preserved on import → export → import.
- **AI proposal acceptance rate:** tracked as a proxy for assistant usefulness and a guard against low-value noise.
- **Sourced-fact ratio:** share of facts carrying at least one citation (the product's core value, made measurable).
- **Successful hint connections:** matches that lead to a mutual-consent connection.
- **Casual-user retention:** week-1 and week-4 return rates.
- **Self-host footprint:** count of deployments in the wild (via opt-in, anonymous signal only — see open questions).

## 8. Sequencing

Provenance ships continuously and is stood up in a live lab as it goes; there is no hard MVP/v2 line, but features land in dependency order so each tranche is usable.

- **Phase 0 — Foundation:** *(shipped)* backend + DB schema; local auth + email verify; frontend scaffold; container images; CI/CD (Gitea Actions → Gitea registry → server pull); one-command compose deploy.
- **Phase 1 — Core tree:** *(shipped)* people, relationships, events; sources & citations; media uploads; soft delete + recovery; tree-level privacy (now four levels: private/unlisted/site_members/public).
- **Phase 2 — Standards & polish:** *(partly shipped — GEDCOM 7 import/export #232; fuzzy/trigram search)* GEDCOM 7 import/export; search with fuzzy names; living-person protection; person-level privacy override; onboarding + persona selector.
- **Phase 3 — Property:** property entity; ownership events; chain-of-title view; property-aware sources.
- **Phase 4 — AI assistant:** *(partly shipped early — provider abstraction + multi-provider registry #235/#237; ChangeProposal propose-then-confirm #236)* provider abstraction (hosted + local); scraper plugin framework; first connectors (FamilySearch, Find A Grave); propose-diff approval flow; assistant actions in audit log.
- **Phase 5 — Federated auth:** *(not shipped — only the `AuthProvider` ABC exists)* OIDC (Authentik), then Google/Apple/Facebook sign-in.
- **Phase 6 — Collaboration:** *(tree membership #233 landed early)* tree co-owners; audit-log UI; direct messaging; notifications.
- **Phase 7 — Cross-tree hints:** async matching engine (embeddings-assisted); anonymous match notifications; mutual-consent reveal.
- **Phase 8 — Land sources:** BLM/GLO patents; USGS map integration; additional county-deed connectors (merge existing scrapers).
- **Phase 9 — Hardening & dogfooding** toward a possible hosted offering.

**Shipped ahead of sequence (operations & platform):** instance-owner/operator role (#240); operator backup tooling (#234); a schema-drift guard (#239). These landed early because the live lab deployment needed them. Note that despite their later issue numbers, **Phase 5 federated auth/OIDC is not yet shipped** — only the `AuthProvider` ABC is in place.

Rationale: enabling work (schema, auth, deploy, sources) precedes everything; GEDCOM lands before the assistant so AI writes target a stable model; property follows a well-tested people graph; hints come late because they require multiple populated trees.

## 9. Technical direction (summary)

Not a substitute for an architecture document, but the agreed shape:

- **Frontend:** Next.js + React + TypeScript + Tailwind + shadcn/ui; server components for SEO-friendly public trees; designed mobile-first.
- **Backend:** Python + FastAPI; async; auto-generated OpenAPI (a natural base for the assistant's scoped tool surface).
- **Database:** PostgreSQL (relational core + JSONB for flexible attributes; trigram search; vector embeddings for match ranking).
- **Object storage:** S3-compatible, configurable (MinIO for self-host; any S3 provider otherwise).
- **Packaging:** container-native; everything configurable via environment so third parties could run their own deployment.
- **Edge:** Caddy reverse proxy; Cloudflare Tunnel preferred but never required.
- **Email:** operator-configured SMTP.

## 10. Assumptions, dependencies & risks

**Assumptions:** hosted model APIs remain available to BYO-key users; FamilySearch and similar APIs remain accessible for personal use; BLM/GLO and USGS data remain public; operators can run Postgres + S3-compatible storage.

**Dependencies:** PostgreSQL; an S3-compatible store; SMTP; optional Cloudflare Tunnel; optional external model providers.

**Risks:**
- *Scraper fragility* — sources change; need monitoring and graceful failure, not hard crashes.
- *AI proposal quality* — even with human review, low-value proposals waste reviewer time; need confidence signals and easy rejection.
- *Hint privacy leakage* — naive matching could expose PII; the reveal-on-consent design and a written threat model are mandatory before launch.
- *GEDCOM messiness* — real-world files are inconsistent even under v7; expect ongoing compatibility work.
- *Genealogical edge cases* — adoption, donor conception, multiple/again marriages, name changes, historically polygamous families; the schema must handle these without awkward workarounds.
- *License comprehension* — "source-available" is easily confused with open source; messaging must be clear.

## 11. Open questions

- **Telemetry:** ship opt-in, fully anonymous usage signal to understand real-world use, or none at all?
- **Embeddings for matching:** hosted provider vs. local model — and is an embedding even needed in early phases, or does fuzzy + rules suffice?
- **DNA:** revisit as a future phase, or hold as a permanent non-goal?
- **Native mobile:** if/when, and PWA vs. true native?
- **Hosted SaaS:** business model, and what (if anything) differs from the self-hostable build?