provenance/docs/ARCHITECTURE.md

# Provenance — Architecture

**Status:** Draft v0.1
**Companion to:** [docs/PRD.md](PRD.md)
**Last updated:** 2026-06-06

This document describes *how* Provenance is built. The PRD describes *what* and *why*. Where the two disagree, the PRD wins on intent and this document is updated to match.

---

## 1. Guiding constraints

These come straight from the PRD and shape every decision below:

1. **Self-hostable and container-native.** Everything runs in containers, configured by environment. A single `compose` stack stands up the whole system.
2. **Multi-tenant.** Many users and many trees coexist on one deployment with hard data isolation between tenants.
3. **Privacy has no bypass.** Every read path — UI, API, public link, AI assistant, search, hints — resolves visibility through one engine. There is no "admin shortcut" that returns raw rows.
4. **The AI assistant never writes autonomously.** It is a scoped principal that can only *propose* changes; persistence requires explicit user approval.
5. **Sources are first-class.** The schema makes a fact-without-a-source possible but a fact-with-a-source cheap and visible.
6. **Configurable integrations.** Auth, mail, object storage, model providers, and scrapers are all pluggable so a third party could run their own deployment.

## 2. System topology

```
                         ┌─────────────────────────┐
        Internet  ─────► │  Cloudflare Tunnel (opt) │
                         └────────────┬────────────┘
                                      │
                              ┌───────▼────────┐
                              │     Caddy      │  TLS, reverse proxy
                              └───┬────────┬───┘
                                  │        │
                   /api, /assistant│        │  / (everything else)
                          ┌────────▼──┐  ┌──▼─────────────┐
                          │  Backend  │  │   Frontend     │
                          │  FastAPI  │  │   Next.js      │
                          └─┬───┬───┬─┘  └────────────────┘
                            │   │   │
              ┌─────────────┘   │   └──────────────┐
        ┌─────▼─────┐    ┌──────▼──────┐    ┌──────▼──────┐
        │ PostgreSQL│    │ S3-compatible│    │   Worker(s) │
        │           │    │   storage    │    │ (jobs/queue)│
        └───────────┘    │ (MinIO etc.) │    └──────┬──────┘
                         └──────────────┘           │
                                          ┌──────────▼──────────┐
                                          │ External services:  │
                                          │ model providers,    │
                                          │ scraper sources,    │
                                          │ SMTP                │
                                          └─────────────────────┘
```

Containers:

- **frontend** — Next.js (React, TypeScript, Tailwind, shadcn/ui). Renders public trees via server components for SEO; client components for the editing experience.
- **backend** — FastAPI (Python). The single source of truth for the domain, auth, the privacy engine, and the assistant's tool surface. Exposes an OpenAPI-described REST API.
- **worker** — same image as backend, run in a worker mode. Handles async jobs: GEDCOM import/export, media processing, scraping, hint matching, scheduled purges.
- **postgres** — relational core, JSONB attributes, `pg_trgm` fuzzy search, `pgvector` for match ranking.
- **objectstore** — S3-compatible; MinIO for self-host, any S3 provider otherwise.
- **caddy** — TLS + reverse proxy. Optional **cloudflared** sidecar for tunnel deployments.

A message queue (e.g., Redis or Postgres-backed) decouples the API from the worker. The choice is deliberately behind an interface so small deployments can use a Postgres-based queue and larger ones can swap in Redis.

## 3. Backend architecture

Layered, dependency pointing inward:

- **API layer** — FastAPI routers. Thin: validate input, resolve the caller, delegate to services, serialize output. Auto-generated OpenAPI spec is the contract the frontend and the assistant tool layer both consume.
- **Service layer** — all domain logic and the only place writes happen. Enforces invariants (e.g., "a write must carry an actor for the audit log"). The privacy engine is invoked here on every read.
- **Repository layer** — data access over SQLAlchemy; no business rules.
- **Domain models** — the entities in §5.
- **Integrations** — adapters behind interfaces: `AuthProvider`, `ObjectStore`, `Mailer`, `LLMProvider` / `EmbeddingProvider` (two separate model abstractions), `SourceConnector`, `Queue`. Swapping an implementation is a config change, not a code change.

Async throughout (FastAPI + async SQLAlchemy). Anything that can be slow or can fail externally (model calls, scraping, large imports) goes to the worker, never inline in a request.

## 4. Frontend architecture

- **Next.js App Router.** Public, read-only tree/property views are server-rendered for SEO and speed; they call the backend through the same privacy-filtered API used everywhere else.
- **Authoring UI** is client-side, talking to the REST API. A generated TypeScript client keeps it in sync with the OpenAPI contract.
- **Mobile-first.** Layouts are responsive and touch-first; no separate mobile codebase. Feature parity with desktop is a requirement, not a nice-to-have.
- **Design system:** Tailwind + shadcn/ui for a consistent, accessible (WCAG 2.2 AA target) component layer.
- *Status:* the scaffold has landed — Next.js (App Router) + Tailwind + shadcn-style primitives, a typed client generated from the backend OpenAPI spec (`openapi-typescript` + `openapi-fetch`), and auth + tree/person views. Auth rides the same-origin HttpOnly session cookie (Caddy proxies `/api/*` to the backend). Built as a standalone container; Caddy routes `/` to it.

## 5. Data model

Core entities and the important relationships. (Illustrative, not final DDL.)

### Tenancy & identity
- **User** — a person with login. Auth method(s) are attached but identity is internal, so one user can link multiple providers.
- **Tree** — the top-level tenant boundary for genealogical data. Owned by a User; may have additional members. Carries a per-tree **AI model policy** (owner-configured): `ai_member_provider` and `ai_recommender_provider` name configured providers from the model-provider registry (null = no model for that role); the owner may use any configured provider, while these cap what members and the recommender may use. Set via the owner-only `GET`/`PATCH /trees/{id}/ai`.
- **TreeMembership** — (User, Tree, role) where role ∈ {owner, editor, viewer}. The basis for authorization *within a tree*.
- **Instance owner / operator** — orthogonal to tree roles. The account(s) whose email is named in the `OWNER_EMAIL` env var **and whose email is verified** are the instance's operator(s), with access to the owner-only `/api/v1/admin` surface (operational status, instance-wide config). Derived from the env at request time — no DB column, no migration, can't drift, survives DB resets. The verified-email requirement is deliberate: registration is open, so without it whoever registers the owner address first would seize the role — verification ties ownership to proven control of the inbox. Crucially this is **not** a privacy bypass: an instance owner gets operational/config rights, **not** read access to other users' private trees or living-person PII — those still resolve only through the privacy engine. (`is_instance_owner` in `api/deps.py`.)

### Genealogical core
- **Person** — belongs to a Tree. Has many **Name** records (with parts: given, surname, prefix/suffix, and a type such as birth/married/alias) to support variants and changes over time. Carries living/deceased status.
- **Relationship** — typed edge between two Persons within a Tree: parent–child (with a qualifier: biological, adoptive, step, foster, donor, guardian), partnership/marriage (with its own events), sibling (often derived). Modeling parentage as qualified edges — rather than assuming a two-biological-parent nuclear family — is what makes adoption, donor conception, and blended families first-class rather than awkward.
- **Event** — typed (birth, death, marriage, residence, immigration, etc.), with a date (supporting ranges, approximations, and non-Gregorian calendars), an optional Place, and attachable to a Person or a partnership.
- **Place** — a tenant-shared gazetteer entity: hierarchical (place within place), with **historical name variants and date ranges** so a record entered as "Königsberg, 1900" sorts and displays correctly against "Kaliningrad." Optional coordinates. *(Phase 0 scopes Place to a Tree via `tree_id` for absolute tenant isolation; a deployment-wide shared gazetteer is a deliberate later refinement. Variants live in a `PlaceName` child table.)*

### Sources (first-class)
- **Source** — a reusable record of an origin: title, repository, type, optional URL, free citation text, optional quality grade. One Source backs many facts.
- **Citation** — the join between a Source and a specific fact (Event, Name, Relationship, OwnershipEvent), carrying locality (page, line, entry) and confidence. This is the spine of the whole product.
- **Media** — a binary asset (image, scan, PDF, audio) stored in object storage; row holds metadata, checksum, and content type. Attachable to Person, Source, Event, or Property. EXIF location/PII can be stripped on upload.

### Property & land
- **Property** — belongs to a Tree. Has informal names ("the old McGregor place") and a **legal description** stored verbatim (metes-and-bounds prose or PLSS township/range/section), an optional geometry, and a lifespan. Designed so future parcels that split or merge can reference predecessors/successors.
- **OwnershipEvent** — the property analogue of a life event: type (grant/patent, purchase, sale, inheritance, gift, tax sale, foreclosure, eminent domain), date, grantor Person(s), grantee Person(s), optional consideration, and a Citation. Chain-of-title is the ordered set of OwnershipEvents for a Property; "what did this person own" is the reverse traversal.

### Cross-cutting
- **AuditEntry** — append-only: actor (User *or* the assistant principal acting for a User), action, entity, before/after snapshot, timestamp. Immutable.
- **SoftDelete** — entities carry `deleted_at`; a scheduled worker purges rows older than 30 days. Recovery = clearing `deleted_at` within the window.
- **ChangeProposal** — a pending set of writes: records an `origin` (`assistant` | `contributor` — collaborator suggestions are encoded today, not just a future idea), a `status` (pending/applied/rejected), a structured `operations` diff (JSONB list of `{op, entity_type, entity_id?, payload}`), a summary/rationale, and review/apply-error metadata. The user approves, edits, or rejects; approved proposals are applied through the normal service layer (so they hit the privacy engine and audit log like any other write). *Note: v1 apply is not cross-op transactional — see `docs/design/change-proposal.md`.*

## 6. Privacy engine

A single function conceptually:

```
visible(viewer, entity) -> { full | redacted | hidden }
```

Inputs: viewer's role on the entity's Tree (including "anonymous"), the Tree's visibility (public / site_members / unlisted / private), per-Person privacy override, and living-person status.

Rules:
- **Tree private** → only members see anything.
- **Tree site_members** → any authenticated account on this instance gets a read view (anonymous viewers get nothing), still per-person living/override filtered.
- **Tree unlisted / public** → non-members *including anonymous viewers* get a read view, *but* every Person is run through the living-person check and per-person override first. Unlisted is gated only by knowing the link (never listed or search-indexed); public is listed in `/explore` and indexable.
- **Living-person rule** — a Person with no death fact, whose birth is within a configurable recency window (default ~100 years; unknown birth treated as possibly-living), is redacted (name minimized, vitals/events/media hidden) for non-owners. Owners may override per Person.
- The engine is invoked in the **service layer**, so it covers API, server-rendered public pages, search results, and any data the assistant can read. There is intentionally no path that returns rows without passing through it.

## 7. AI research assistant

Three parts, deliberately separated:

1. **Model provider abstraction** — two separate interfaces, `LLMProvider` and `EmbeddingProvider` (configured independently — e.g. Anthropic has no embeddings endpoint), over hosted models (Anthropic, OpenAI, xAI) and self-hosted/local models via an OpenAI-compatible endpoint or Ollama. An operator can configure **several providers at once** through a registry (`build_llm_providers()`/`configured_llm_providers()`), each selectable by name — the basis for the per-tree AI policy and the `default_llm_provider`/`default_embedding_provider` settings. Keys supplied by the operator (this deployment) or by the user (BYO-key deployments).
2. **Scoped tool surface** — the assistant can only act through a constrained set of tools that map to service-layer operations, **scoped to the user it is helping.** It is its own principal: it cannot exceed that user's rights, and every action is attributed to "assistant (on behalf of User X)" in the audit log. This is the MCP-style boundary referenced in the PRD — the assistant gets capabilities, not raw database access.
3. **Source connectors** (`SourceConnector`) — a plugin framework for *reading* external data: FamilySearch API, Find A Grave, WikiTree, BLM/GLO land patents, USGS maps, public-domain newspapers, public county records. Only legally permissible sources ship with the project; operators can add their own. Connectors are read-only and rate-limited, and run in the worker.

**Propose-then-confirm is structural, not a policy.** The assistant's "write" tools don't write — they emit a `ChangeProposal`. The only thing that applies a proposal is a user action in the UI. There is no code path where a model response mutates the tree directly. This makes hallucinations a review annoyance, never a data-integrity event.

## 8. Hint & matching engine

- Runs entirely in the **worker**, asynchronously, never in a user request.
- **Candidate generation:** blocking on normalized name + key vitals + place to find plausible same-individual pairs across Trees.
- **Ranking:** `pgvector` embeddings over a person's salient features reduce false positives beyond what string similarity alone gives.
- **Privacy-preserving by construction:** the notification a user receives describes *that* a possible match exists, with zero identifying detail about the other user or any living person. Identities and data are exchanged only after **mutual opt-in**, after which the messaging channel opens. The matching threat model (what a malicious user could infer) is documented before this ships.

## 9. Authentication

- `AuthProvider` interface with implementations for **local** (password + email verification/reset), **OIDC** (validated against Authentik; expected to work with Keycloak, Auth0, etc.), and **social** (Google, Apple, Facebook).
- Operators enable any subset via config. This deployment will use Authentik (`auth.jpaul.io`) plus selected social providers; a bare self-hoster can run local-only.
- Sessions are backend-issued; the assistant principal is minted per-session and scoped to the acting user.
- *Status:* **local auth has landed** — Argon2id password hashing, opaque backend-issued sessions (only the token hash is stored; presented as a Bearer token or HttpOnly cookie), and email verification + password reset via the `Mailer` interface (console in dev, SMTP for operators). An opt-in gate (`REQUIRE_EMAIL_VERIFICATION`, default off so SMTP-less self-hosts and pre-existing accounts aren't locked out) refuses sessions for accounts without a verified email — login is denied and existing sessions stop resolving until the address is verified. OIDC and social providers are Phase 5. Every write records an attributable actor in the audit log.
- **Instance owner / operator** (orthogonal to the per-tree roles): the account(s) whose email is in `OWNER_EMAIL` *and* is verified are the instance operator(s), with the owner-only `/api/v1/admin` surface (operational status, instance-wide config). Derived from the env at request time — no DB column. It is an operator/config role, **not** a privacy bypass: it grants no read access to other users' private trees or living-person PII. (`is_instance_owner` in `api/deps.py`.)

## 10. Search

- Postgres `pg_trgm` for fuzzy name matching (Mueller/Müller/Muller), plus date and place filters.
- All queries are filtered through the privacy engine; a non-owner can never surface a redacted living person via search.
- Place normalization uses the gazetteer's historical variants so a search hits regardless of which era's name was recorded.

## 11. Background jobs

Worker-handled, queue-driven:
- GEDCOM import (with mapping report) and export
- Media processing (thumbnails, EXIF stripping, checksums)
- Source-connector scraping
- Hint candidate generation + ranking
- Scheduled soft-delete purge (30-day window)
- Outbound email via SMTP

Jobs are idempotent and retryable; an external failure degrades gracefully rather than corrupting state.

## 12. Deployment & CI/CD

- **Images** are built by **Gitea Actions** (`runs-on: docker`) and pushed to the **Gitea container registry**, one package per component (`provenance-backend`, `provenance-frontend`) linked to the repo.
- **Split push/pull endpoints** (mirrors the drawbar setup): CI **pushes** to the LAN registry endpoint `192.168.0.2:1234` over plain HTTP (buildx configured `insecure`/`http`) to bypass the Cloudflare request-body limit; servers **pull** from the public `git.jpaul.io` FQDN (TLS via Cloudflare). Same Gitea registry, two front doors. Auth uses the `REGISTRY_TOKEN` Actions secret.
- Tag scheme: `test-main` (current main), `test-sha-<long>` (rollback pins), the component version, and `latest` on `v*` tags.
- Servers **pull** new images to deploy — no build on the host. The deploy compose references `git.jpaul.io/justin/provenance-{backend,frontend}:${IMAGE_TAG:-test-main}`; `docker-compose.dev.yml` is a local-build override.
- **Caddy** terminates TLS and reverse-proxies frontend + backend. **Cloudflare Tunnel** is the preferred ingress (no open inbound ports) but is never required; a plain Caddy-on-a-public-host deployment is equally supported.
- **Configuration** is entirely environment-driven (twelve-factor). One `.env` plus the compose file is enough to stand up a deployment; the backend/worker/migrate services read it via `env_file`, so every setting in `app/core/config.py` is configurable without a compose edit.
- **Migrations** run on backend start (`RUN_MIGRATIONS=1`) and via a one-shot `migrate` compose service, so an image pull + restart is a complete upgrade. A **schema-drift guard** (defense in depth) makes a half-applied deploy loud rather than a silent storm of 500s: `/health/ready` returns 503 and startup logs a CRITICAL `SCHEMA DRIFT` line when the DB's `alembic_version` is behind the heads baked into the image (`app/core/schema_version.py`).
- **Backups:** a one-command operator script (`deploy/backup.sh` — `pg_dump` + MinIO object sync, see `deploy/BACKUP.md`) plus a per-account ZIP export; restore is the inverse.

**Repository layout (as scaffolded):**

```
/backend           # FastAPI, uv-managed. app/{api/v1, services (+privacy), repositories, models, schemas, integrations (auth, mailer, objectstore, models = LLM/embedding providers), core}; migrations/ = Alembic
/deploy            # docker-compose.yml (+ docker-compose.dev.yml), Caddyfile, .env.example, backup.sh + BACKUP.md
/.gitea/workflows  # Gitea Actions: build images → Gitea registry
/frontend          # Next.js (App Router, TS, Tailwind). app/ pages, lib/api (openapi-typescript client), components/ui, Dockerfile (standalone)
```

The compose stack runs `postgres` (pgvector image — includes `pgvector`; `pg_trgm` ships in contrib), `minio`, a one-shot `migrate` job, `backend`, the **worker** (same image as backend, worker mode — runs the scheduled soft-delete purge), `caddy`, and an optional `cloudflared` tunnel. The backend exposes `/health` (liveness) and `/health/ready` (Postgres reachability + schema-drift check).

## 13. Observability

- Structured (JSON) logs from backend and worker.
- Health/readiness endpoints for the proxy and orchestrator.
- Metrics endpoint (Prometheus-style) for queue depth, job outcomes, request latency, and model/connector call results.

## 14. Open architectural questions

Tracked alongside the PRD's open questions:
- **Queue backend default:** Postgres-based for zero-extra-service self-hosting vs. Redis for throughput — likely ship the Postgres option first, Redis as an opt-in.
- **Embeddings source:** hosted provider vs. local model for the hint engine; and whether early phases even need embeddings or can ship with fuzzy+rules first.
- **Geometry stack:** plain coordinates initially vs. PostGIS for real parcel geometry later (PostGIS is a bigger self-host ask, so it stays optional/deferred).
- **GEDCOM extension fidelity:** how aggressively to preserve vendor extension tags on round-trip.