Files
provenance/docs/ARCHITECTURE.md
T
justin 03124027fe Record Phase 0 repo layout in CLAUDE.md and ARCHITECTURE
Documents the scaffolded tree (/backend, /deploy, /.gitea, pending /frontend), the deploy-first sequencing, and the toolchain choices (uv for backend deps, Alembic for migrations), as CLAUDE.md's layout section requires when code lands.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Justin Paul <justin@jpaul.me>
2026-06-06 10:17:12 -04:00

202 lines
16 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Provenance — Architecture
**Status:** Draft v0.1
**Companion to:** [docs/PRD.md](PRD.md)
**Last updated:** 2026-06-06
This document describes *how* Provenance is built. The PRD describes *what* and *why*. Where the two disagree, the PRD wins on intent and this document is updated to match.
---
## 1. Guiding constraints
These come straight from the PRD and shape every decision below:
1. **Self-hostable and container-native.** Everything runs in containers, configured by environment. A single `compose` stack stands up the whole system.
2. **Multi-tenant.** Many users and many trees coexist on one deployment with hard data isolation between tenants.
3. **Privacy has no bypass.** Every read path — UI, API, public link, AI assistant, search, hints — resolves visibility through one engine. There is no "admin shortcut" that returns raw rows.
4. **The AI assistant never writes autonomously.** It is a scoped principal that can only *propose* changes; persistence requires explicit user approval.
5. **Sources are first-class.** The schema makes a fact-without-a-source possible but a fact-with-a-source cheap and visible.
6. **Configurable integrations.** Auth, mail, object storage, model providers, and scrapers are all pluggable so a third party could run their own deployment.
## 2. System topology
```
┌─────────────────────────┐
Internet ─────► │ Cloudflare Tunnel (opt) │
└────────────┬────────────┘
┌───────▼────────┐
│ Caddy │ TLS, reverse proxy
└───┬────────┬───┘
│ │
/api, /assistant│ │ / (everything else)
┌────────▼──┐ ┌──▼─────────────┐
│ Backend │ │ Frontend │
│ FastAPI │ │ Next.js │
└─┬───┬───┬─┘ └────────────────┘
│ │ │
┌─────────────┘ │ └──────────────┐
┌─────▼─────┐ ┌──────▼──────┐ ┌──────▼──────┐
│ PostgreSQL│ │ S3-compatible│ │ Worker(s) │
│ │ │ storage │ │ (jobs/queue)│
└───────────┘ │ (MinIO etc.) │ └──────┬──────┘
└──────────────┘ │
┌──────────▼──────────┐
│ External services: │
│ model providers, │
│ scraper sources, │
│ SMTP │
└─────────────────────┘
```
Containers:
- **frontend** — Next.js (React, TypeScript, Tailwind, shadcn/ui). Renders public trees via server components for SEO; client components for the editing experience.
- **backend** — FastAPI (Python). The single source of truth for the domain, auth, the privacy engine, and the assistant's tool surface. Exposes an OpenAPI-described REST API.
- **worker** — same image as backend, run in a worker mode. Handles async jobs: GEDCOM import/export, media processing, scraping, hint matching, scheduled purges.
- **postgres** — relational core, JSONB attributes, `pg_trgm` fuzzy search, `pgvector` for match ranking.
- **objectstore** — S3-compatible; MinIO for self-host, any S3 provider otherwise.
- **caddy** — TLS + reverse proxy. Optional **cloudflared** sidecar for tunnel deployments.
A message queue (e.g., Redis or Postgres-backed) decouples the API from the worker. The choice is deliberately behind an interface so small deployments can use a Postgres-based queue and larger ones can swap in Redis.
## 3. Backend architecture
Layered, dependency pointing inward:
- **API layer** — FastAPI routers. Thin: validate input, resolve the caller, delegate to services, serialize output. Auto-generated OpenAPI spec is the contract the frontend and the assistant tool layer both consume.
- **Service layer** — all domain logic and the only place writes happen. Enforces invariants (e.g., "a write must carry an actor for the audit log"). The privacy engine is invoked here on every read.
- **Repository layer** — data access over SQLAlchemy; no business rules.
- **Domain models** — the entities in §5.
- **Integrations** — adapters behind interfaces: `AuthProvider`, `ObjectStore`, `Mailer`, `ModelProvider`, `SourceConnector`, `Queue`. Swapping an implementation is a config change, not a code change.
Async throughout (FastAPI + async SQLAlchemy). Anything that can be slow or can fail externally (model calls, scraping, large imports) goes to the worker, never inline in a request.
## 4. Frontend architecture
- **Next.js App Router.** Public, read-only tree/property views are server-rendered for SEO and speed; they call the backend through the same privacy-filtered API used everywhere else.
- **Authoring UI** is client-side, talking to the REST API. A generated TypeScript client keeps it in sync with the OpenAPI contract.
- **Mobile-first.** Layouts are responsive and touch-first; no separate mobile codebase. Feature parity with desktop is a requirement, not a nice-to-have.
- **Design system:** Tailwind + shadcn/ui for a consistent, accessible (WCAG 2.2 AA target) component layer.
## 5. Data model
Core entities and the important relationships. (Illustrative, not final DDL.)
### Tenancy & identity
- **User** — a person with login. Auth method(s) are attached but identity is internal, so one user can link multiple providers.
- **Tree** — the top-level tenant boundary for genealogical data. Owned by a User; may have additional members.
- **TreeMembership** — (User, Tree, role) where role ∈ {owner, editor, viewer}. The basis for authorization.
### Genealogical core
- **Person** — belongs to a Tree. Has many **Name** records (with parts: given, surname, prefix/suffix, and a type such as birth/married/alias) to support variants and changes over time. Carries living/deceased status.
- **Relationship** — typed edge between two Persons within a Tree: parentchild (with a qualifier: biological, adoptive, step, foster, donor, guardian), partnership/marriage (with its own events), sibling (often derived). Modeling parentage as qualified edges — rather than assuming a two-biological-parent nuclear family — is what makes adoption, donor conception, and blended families first-class rather than awkward.
- **Event** — typed (birth, death, marriage, residence, immigration, etc.), with a date (supporting ranges, approximations, and non-Gregorian calendars), an optional Place, and attachable to a Person or a partnership.
- **Place** — a tenant-shared gazetteer entity: hierarchical (place within place), with **historical name variants and date ranges** so a record entered as "Königsberg, 1900" sorts and displays correctly against "Kaliningrad." Optional coordinates.
### Sources (first-class)
- **Source** — a reusable record of an origin: title, repository, type, optional URL, free citation text, optional quality grade. One Source backs many facts.
- **Citation** — the join between a Source and a specific fact (Event, Name, Relationship, OwnershipEvent), carrying locality (page, line, entry) and confidence. This is the spine of the whole product.
- **Media** — a binary asset (image, scan, PDF, audio) stored in object storage; row holds metadata, checksum, and content type. Attachable to Person, Source, Event, or Property. EXIF location/PII can be stripped on upload.
### Property & land
- **Property** — belongs to a Tree. Has informal names ("the old McGregor place") and a **legal description** stored verbatim (metes-and-bounds prose or PLSS township/range/section), an optional geometry, and a lifespan. Designed so future parcels that split or merge can reference predecessors/successors.
- **OwnershipEvent** — the property analogue of a life event: type (grant/patent, purchase, sale, inheritance, gift, tax sale, foreclosure, eminent domain), date, grantor Person(s), grantee Person(s), optional consideration, and a Citation. Chain-of-title is the ordered set of OwnershipEvents for a Property; "what did this person own" is the reverse traversal.
### Cross-cutting
- **AuditEntry** — append-only: actor (User *or* the assistant principal acting for a User), action, entity, before/after snapshot, timestamp. Immutable.
- **SoftDelete** — entities carry `deleted_at`; a scheduled worker purges rows older than 30 days. Recovery = clearing `deleted_at` within the window.
- **ChangeProposal** — a pending set of writes generated by the assistant (or potentially a collaborator suggestion later): a structured diff the user approves, edits, or rejects. Approved proposals are applied through the normal service layer (so they hit the privacy engine and the audit log like any other write).
## 6. Privacy engine
A single function conceptually:
```
visible(viewer, entity) -> { full | redacted | hidden }
```
Inputs: viewer's role on the entity's Tree (including "anonymous"), the Tree's visibility (public/unlisted/private), per-Person privacy override, and living-person status.
Rules:
- **Tree private** → only members see anything.
- **Tree public/unlisted** → non-members get a read view, *but* every Person is run through the living-person check and per-person override first.
- **Living-person rule** — a Person with no death fact, whose birth is within a configurable recency window (default ~100 years; unknown birth treated as possibly-living), is redacted (name minimized, vitals/events/media hidden) for non-owners. Owners may override per Person.
- The engine is invoked in the **service layer**, so it covers API, server-rendered public pages, search results, and any data the assistant can read. There is intentionally no path that returns rows without passing through it.
## 7. AI research assistant
Three parts, deliberately separated:
1. **Model provider abstraction** (`ModelProvider`) — one interface over hosted models (Anthropic, OpenAI, xAI) and self-hosted/local models via an OpenAI-compatible endpoint or Ollama. Configurable per deployment; keys supplied by the operator (this deployment) or by the user (BYO-key deployments).
2. **Scoped tool surface** — the assistant can only act through a constrained set of tools that map to service-layer operations, **scoped to the user it is helping.** It is its own principal: it cannot exceed that user's rights, and every action is attributed to "assistant (on behalf of User X)" in the audit log. This is the MCP-style boundary referenced in the PRD — the assistant gets capabilities, not raw database access.
3. **Source connectors** (`SourceConnector`) — a plugin framework for *reading* external data: FamilySearch API, Find A Grave, WikiTree, BLM/GLO land patents, USGS maps, public-domain newspapers, public county records. Only legally permissible sources ship with the project; operators can add their own. Connectors are read-only and rate-limited, and run in the worker.
**Propose-then-confirm is structural, not a policy.** The assistant's "write" tools don't write — they emit a `ChangeProposal`. The only thing that applies a proposal is a user action in the UI. There is no code path where a model response mutates the tree directly. This makes hallucinations a review annoyance, never a data-integrity event.
## 8. Hint & matching engine
- Runs entirely in the **worker**, asynchronously, never in a user request.
- **Candidate generation:** blocking on normalized name + key vitals + place to find plausible same-individual pairs across Trees.
- **Ranking:** `pgvector` embeddings over a person's salient features reduce false positives beyond what string similarity alone gives.
- **Privacy-preserving by construction:** the notification a user receives describes *that* a possible match exists, with zero identifying detail about the other user or any living person. Identities and data are exchanged only after **mutual opt-in**, after which the messaging channel opens. The matching threat model (what a malicious user could infer) is documented before this ships.
## 9. Authentication
- `AuthProvider` interface with implementations for **local** (password + email verification/reset), **OIDC** (validated against Authentik; expected to work with Keycloak, Auth0, etc.), and **social** (Google, Apple, Facebook).
- Operators enable any subset via config. This deployment will use Authentik (`auth.jpaul.io`) plus selected social providers; a bare self-hoster can run local-only.
- Sessions are backend-issued; the assistant principal is minted per-session and scoped to the acting user.
## 10. Search
- Postgres `pg_trgm` for fuzzy name matching (Mueller/Müller/Muller), plus date and place filters.
- All queries are filtered through the privacy engine; a non-owner can never surface a redacted living person via search.
- Place normalization uses the gazetteer's historical variants so a search hits regardless of which era's name was recorded.
## 11. Background jobs
Worker-handled, queue-driven:
- GEDCOM import (with mapping report) and export
- Media processing (thumbnails, EXIF stripping, checksums)
- Source-connector scraping
- Hint candidate generation + ranking
- Scheduled soft-delete purge (30-day window)
- Outbound email via SMTP
Jobs are idempotent and retryable; an external failure degrades gracefully rather than corrupting state.
## 12. Deployment & CI/CD
- **Images** are built by **Gitea Actions** on `git.jpaul.io` and pushed to the **Gitea container registry**.
- Servers **pull** new images to deploy — no build on the host.
- **Caddy** terminates TLS and reverse-proxies frontend + backend. **Cloudflare Tunnel** is the preferred ingress (no open inbound ports) but is never required; a plain Caddy-on-a-public-host deployment is equally supported.
- **Configuration** is entirely environment-driven (twelve-factor). One `.env` plus the compose file is enough to stand up a deployment.
- **Migrations** run on backend start (or via an explicit job) so an image pull + restart is a complete upgrade.
- **Backups:** documented procedure for Postgres dump + object-store sync; restore is the inverse.
**Repository layout (as scaffolded):**
```
/backend # FastAPI service, uv-managed (app/ = api, core; service/repository/domain land with the data model)
/deploy # docker-compose.yml, Caddyfile, .env.example
/.gitea/workflows # Gitea Actions: build images → Gitea registry
/frontend # Next.js (pending)
```
The compose stack runs `postgres` (pgvector image — includes `pgvector`; `pg_trgm` ships in contrib), `minio`, `backend`, and `caddy`. The **worker** container (same image as backend, worker mode) joins once queue-driven jobs exist. Phase 0 ships a minimal backend with `/health` (liveness) and `/health/ready` (Postgres reachability) to validate the deploy wiring before the data model lands.
## 13. Observability
- Structured (JSON) logs from backend and worker.
- Health/readiness endpoints for the proxy and orchestrator.
- Metrics endpoint (Prometheus-style) for queue depth, job outcomes, request latency, and model/connector call results.
## 14. Open architectural questions
Tracked alongside the PRD's open questions:
- **Queue backend default:** Postgres-based for zero-extra-service self-hosting vs. Redis for throughput — likely ship the Postgres option first, Redis as an opt-in.
- **Embeddings source:** hosted provider vs. local model for the hint engine; and whether early phases even need embeddings or can ship with fuzzy+rules first.
- **Geometry stack:** plain coordinates initially vs. PostGIS for real parcel geometry later (PostGIS is a bigger self-host ask, so it stays optional/deferred).
- **GEDCOM extension fidelity:** how aggressively to preserve vendor extension tags on round-trip.