Files
provenance/docs/ARCHITECTURE.md
T
justin 447daf7fa8 docs: bring all documentation current with shipped work
A multi-agent audit of every doc against the code surfaced ~50 stale/missing
items (the roadmap/status docs and the backlog had fallen behind the code).
This catches them up:

- CLAUDE.md: phase status was ~3 phases stale ("Phase 1 is next" while Phase 1 +
  chunks of 2 & 4 shipped). Rewrote the status list; added a model-provider
  tech-stack entry; updated repo-layout (integrations objectstore/models,
  deploy backup.sh/dev compose).
- ARCHITECTURE.md: §6 privacy engine described 3 visibility levels — corrected to
  the shipped 4 (adds site_members); documented per-tree AI policy on Tree,
  LLMProvider/EmbeddingProvider split + registry, ChangeProposal origin/status/
  operations, verified-email session gate, instance-owner role, schema-drift
  guard, and the env_file config model.
- PRD.md: 4-level visibility in US-040/§5.5, instance-owner role (§5.1/§5.11),
  per-tree AI policy (§5.8), §8 sequencing annotated with shipped status, header
  date/status bumped.
- README.md: 4-level privacy; softened "Full GEDCOM 7" to the 5.5.1/7 common
  subset; noted backups + instance-owner admin; moved property/land to an
  explicit "where it's headed" (no property models exist yet).
- BACKLOG.md: flipped ~15 shipped-but-open rows to Have (ChangeProposal, provider
  abstraction, GEDCOM citation export, membership management, operator backup,
  email-verification gate, per-tree AI policy, instance owner, the whole
  visibility/public-viewing/child-resource-redaction cluster #41-#51/#46), and
  reconciled the executive summary, "current defects" list, quick wins, and
  differentiators. Left genuinely-open items (citation/source redaction, sitemap,
  per-tree noindex, scoped-token API) accurately open.
- .env.example: dropped "SMTP wired in a later phase"; documented the worker
  purge knobs, S3_PRESIGN_TTL, COOKIE_NAME; removed a stray duplicate line.
- design/: tree-visibility.md and change-proposal.md marked Shipped; corrected
  the redaction approach (reuses member schemas, not a separate PublicPersonRead)
  and the apply() rollback claim (v1 is not cross-op transactional), and marked
  rate-limiting/sitemap/noindex as deferred.

No code changes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Justin Paul <justin@jpaul.me>
2026-06-10 21:05:29 -04:00

22 KiB
Raw Blame History

Provenance — Architecture

Status: Draft v0.1 Companion to: docs/PRD.md Last updated: 2026-06-06

This document describes how Provenance is built. The PRD describes what and why. Where the two disagree, the PRD wins on intent and this document is updated to match.


1. Guiding constraints

These come straight from the PRD and shape every decision below:

  1. Self-hostable and container-native. Everything runs in containers, configured by environment. A single compose stack stands up the whole system.
  2. Multi-tenant. Many users and many trees coexist on one deployment with hard data isolation between tenants.
  3. Privacy has no bypass. Every read path — UI, API, public link, AI assistant, search, hints — resolves visibility through one engine. There is no "admin shortcut" that returns raw rows.
  4. The AI assistant never writes autonomously. It is a scoped principal that can only propose changes; persistence requires explicit user approval.
  5. Sources are first-class. The schema makes a fact-without-a-source possible but a fact-with-a-source cheap and visible.
  6. Configurable integrations. Auth, mail, object storage, model providers, and scrapers are all pluggable so a third party could run their own deployment.

2. System topology

                         ┌─────────────────────────┐
        Internet  ─────► │  Cloudflare Tunnel (opt) │
                         └────────────┬────────────┘
                                      │
                              ┌───────▼────────┐
                              │     Caddy      │  TLS, reverse proxy
                              └───┬────────┬───┘
                                  │        │
                   /api, /assistant│        │  / (everything else)
                          ┌────────▼──┐  ┌──▼─────────────┐
                          │  Backend  │  │   Frontend     │
                          │  FastAPI  │  │   Next.js      │
                          └─┬───┬───┬─┘  └────────────────┘
                            │   │   │
              ┌─────────────┘   │   └──────────────┐
        ┌─────▼─────┐    ┌──────▼──────┐    ┌──────▼──────┐
        │ PostgreSQL│    │ S3-compatible│    │   Worker(s) │
        │           │    │   storage    │    │ (jobs/queue)│
        └───────────┘    │ (MinIO etc.) │    └──────┬──────┘
                         └──────────────┘           │
                                          ┌──────────▼──────────┐
                                          │ External services:  │
                                          │ model providers,    │
                                          │ scraper sources,    │
                                          │ SMTP                │
                                          └─────────────────────┘

Containers:

  • frontend — Next.js (React, TypeScript, Tailwind, shadcn/ui). Renders public trees via server components for SEO; client components for the editing experience.
  • backend — FastAPI (Python). The single source of truth for the domain, auth, the privacy engine, and the assistant's tool surface. Exposes an OpenAPI-described REST API.
  • worker — same image as backend, run in a worker mode. Handles async jobs: GEDCOM import/export, media processing, scraping, hint matching, scheduled purges.
  • postgres — relational core, JSONB attributes, pg_trgm fuzzy search, pgvector for match ranking.
  • objectstore — S3-compatible; MinIO for self-host, any S3 provider otherwise.
  • caddy — TLS + reverse proxy. Optional cloudflared sidecar for tunnel deployments.

A message queue (e.g., Redis or Postgres-backed) decouples the API from the worker. The choice is deliberately behind an interface so small deployments can use a Postgres-based queue and larger ones can swap in Redis.

3. Backend architecture

Layered, dependency pointing inward:

  • API layer — FastAPI routers. Thin: validate input, resolve the caller, delegate to services, serialize output. Auto-generated OpenAPI spec is the contract the frontend and the assistant tool layer both consume.
  • Service layer — all domain logic and the only place writes happen. Enforces invariants (e.g., "a write must carry an actor for the audit log"). The privacy engine is invoked here on every read.
  • Repository layer — data access over SQLAlchemy; no business rules.
  • Domain models — the entities in §5.
  • Integrations — adapters behind interfaces: AuthProvider, ObjectStore, Mailer, LLMProvider / EmbeddingProvider (two separate model abstractions), SourceConnector, Queue. Swapping an implementation is a config change, not a code change.

Async throughout (FastAPI + async SQLAlchemy). Anything that can be slow or can fail externally (model calls, scraping, large imports) goes to the worker, never inline in a request.

4. Frontend architecture

  • Next.js App Router. Public, read-only tree/property views are server-rendered for SEO and speed; they call the backend through the same privacy-filtered API used everywhere else.
  • Authoring UI is client-side, talking to the REST API. A generated TypeScript client keeps it in sync with the OpenAPI contract.
  • Mobile-first. Layouts are responsive and touch-first; no separate mobile codebase. Feature parity with desktop is a requirement, not a nice-to-have.
  • Design system: Tailwind + shadcn/ui for a consistent, accessible (WCAG 2.2 AA target) component layer.
  • Status: the scaffold has landed — Next.js (App Router) + Tailwind + shadcn-style primitives, a typed client generated from the backend OpenAPI spec (openapi-typescript + openapi-fetch), and auth + tree/person views. Auth rides the same-origin HttpOnly session cookie (Caddy proxies /api/* to the backend). Built as a standalone container; Caddy routes / to it.

5. Data model

Core entities and the important relationships. (Illustrative, not final DDL.)

Tenancy & identity

  • User — a person with login. Auth method(s) are attached but identity is internal, so one user can link multiple providers.
  • Tree — the top-level tenant boundary for genealogical data. Owned by a User; may have additional members. Carries a per-tree AI model policy (owner-configured): ai_member_provider and ai_recommender_provider name configured providers from the model-provider registry (null = no model for that role); the owner may use any configured provider, while these cap what members and the recommender may use. Set via the owner-only GET/PATCH /trees/{id}/ai.
  • TreeMembership — (User, Tree, role) where role ∈ {owner, editor, viewer}. The basis for authorization within a tree.
  • Instance owner / operator — orthogonal to tree roles. The account(s) whose email is named in the OWNER_EMAIL env var and whose email is verified are the instance's operator(s), with access to the owner-only /api/v1/admin surface (operational status, instance-wide config). Derived from the env at request time — no DB column, no migration, can't drift, survives DB resets. The verified-email requirement is deliberate: registration is open, so without it whoever registers the owner address first would seize the role — verification ties ownership to proven control of the inbox. Crucially this is not a privacy bypass: an instance owner gets operational/config rights, not read access to other users' private trees or living-person PII — those still resolve only through the privacy engine. (is_instance_owner in api/deps.py.)

Genealogical core

  • Person — belongs to a Tree. Has many Name records (with parts: given, surname, prefix/suffix, and a type such as birth/married/alias) to support variants and changes over time. Carries living/deceased status.
  • Relationship — typed edge between two Persons within a Tree: parentchild (with a qualifier: biological, adoptive, step, foster, donor, guardian), partnership/marriage (with its own events), sibling (often derived). Modeling parentage as qualified edges — rather than assuming a two-biological-parent nuclear family — is what makes adoption, donor conception, and blended families first-class rather than awkward.
  • Event — typed (birth, death, marriage, residence, immigration, etc.), with a date (supporting ranges, approximations, and non-Gregorian calendars), an optional Place, and attachable to a Person or a partnership.
  • Place — a tenant-shared gazetteer entity: hierarchical (place within place), with historical name variants and date ranges so a record entered as "Königsberg, 1900" sorts and displays correctly against "Kaliningrad." Optional coordinates. (Phase 0 scopes Place to a Tree via tree_id for absolute tenant isolation; a deployment-wide shared gazetteer is a deliberate later refinement. Variants live in a PlaceName child table.)

Sources (first-class)

  • Source — a reusable record of an origin: title, repository, type, optional URL, free citation text, optional quality grade. One Source backs many facts.
  • Citation — the join between a Source and a specific fact (Event, Name, Relationship, OwnershipEvent), carrying locality (page, line, entry) and confidence. This is the spine of the whole product.
  • Media — a binary asset (image, scan, PDF, audio) stored in object storage; row holds metadata, checksum, and content type. Attachable to Person, Source, Event, or Property. EXIF location/PII can be stripped on upload.

Property & land

  • Property — belongs to a Tree. Has informal names ("the old McGregor place") and a legal description stored verbatim (metes-and-bounds prose or PLSS township/range/section), an optional geometry, and a lifespan. Designed so future parcels that split or merge can reference predecessors/successors.
  • OwnershipEvent — the property analogue of a life event: type (grant/patent, purchase, sale, inheritance, gift, tax sale, foreclosure, eminent domain), date, grantor Person(s), grantee Person(s), optional consideration, and a Citation. Chain-of-title is the ordered set of OwnershipEvents for a Property; "what did this person own" is the reverse traversal.

Cross-cutting

  • AuditEntry — append-only: actor (User or the assistant principal acting for a User), action, entity, before/after snapshot, timestamp. Immutable.
  • SoftDelete — entities carry deleted_at; a scheduled worker purges rows older than 30 days. Recovery = clearing deleted_at within the window.
  • ChangeProposal — a pending set of writes: records an origin (assistant | contributor — collaborator suggestions are encoded today, not just a future idea), a status (pending/applied/rejected), a structured operations diff (JSONB list of {op, entity_type, entity_id?, payload}), a summary/rationale, and review/apply-error metadata. The user approves, edits, or rejects; approved proposals are applied through the normal service layer (so they hit the privacy engine and audit log like any other write). Note: v1 apply is not cross-op transactional — see docs/design/change-proposal.md.

6. Privacy engine

A single function conceptually:

visible(viewer, entity) -> { full | redacted | hidden }

Inputs: viewer's role on the entity's Tree (including "anonymous"), the Tree's visibility (public / site_members / unlisted / private), per-Person privacy override, and living-person status.

Rules:

  • Tree private → only members see anything.
  • Tree site_members → any authenticated account on this instance gets a read view (anonymous viewers get nothing), still per-person living/override filtered.
  • Tree unlisted / public → non-members including anonymous viewers get a read view, but every Person is run through the living-person check and per-person override first. Unlisted is gated only by knowing the link (never listed or search-indexed); public is listed in /explore and indexable.
  • Living-person rule — a Person with no death fact, whose birth is within a configurable recency window (default ~100 years; unknown birth treated as possibly-living), is redacted (name minimized, vitals/events/media hidden) for non-owners. Owners may override per Person.
  • The engine is invoked in the service layer, so it covers API, server-rendered public pages, search results, and any data the assistant can read. There is intentionally no path that returns rows without passing through it.

7. AI research assistant

Three parts, deliberately separated:

  1. Model provider abstraction — two separate interfaces, LLMProvider and EmbeddingProvider (configured independently — e.g. Anthropic has no embeddings endpoint), over hosted models (Anthropic, OpenAI, xAI) and self-hosted/local models via an OpenAI-compatible endpoint or Ollama. An operator can configure several providers at once through a registry (build_llm_providers()/configured_llm_providers()), each selectable by name — the basis for the per-tree AI policy and the default_llm_provider/default_embedding_provider settings. Keys supplied by the operator (this deployment) or by the user (BYO-key deployments).
  2. Scoped tool surface — the assistant can only act through a constrained set of tools that map to service-layer operations, scoped to the user it is helping. It is its own principal: it cannot exceed that user's rights, and every action is attributed to "assistant (on behalf of User X)" in the audit log. This is the MCP-style boundary referenced in the PRD — the assistant gets capabilities, not raw database access.
  3. Source connectors (SourceConnector) — a plugin framework for reading external data: FamilySearch API, Find A Grave, WikiTree, BLM/GLO land patents, USGS maps, public-domain newspapers, public county records. Only legally permissible sources ship with the project; operators can add their own. Connectors are read-only and rate-limited, and run in the worker.

Propose-then-confirm is structural, not a policy. The assistant's "write" tools don't write — they emit a ChangeProposal. The only thing that applies a proposal is a user action in the UI. There is no code path where a model response mutates the tree directly. This makes hallucinations a review annoyance, never a data-integrity event.

8. Hint & matching engine

  • Runs entirely in the worker, asynchronously, never in a user request.
  • Candidate generation: blocking on normalized name + key vitals + place to find plausible same-individual pairs across Trees.
  • Ranking: pgvector embeddings over a person's salient features reduce false positives beyond what string similarity alone gives.
  • Privacy-preserving by construction: the notification a user receives describes that a possible match exists, with zero identifying detail about the other user or any living person. Identities and data are exchanged only after mutual opt-in, after which the messaging channel opens. The matching threat model (what a malicious user could infer) is documented before this ships.

9. Authentication

  • AuthProvider interface with implementations for local (password + email verification/reset), OIDC (validated against Authentik; expected to work with Keycloak, Auth0, etc.), and social (Google, Apple, Facebook).
  • Operators enable any subset via config. This deployment will use Authentik (auth.jpaul.io) plus selected social providers; a bare self-hoster can run local-only.
  • Sessions are backend-issued; the assistant principal is minted per-session and scoped to the acting user.
  • Status: local auth has landed — Argon2id password hashing, opaque backend-issued sessions (only the token hash is stored; presented as a Bearer token or HttpOnly cookie), and email verification + password reset via the Mailer interface (console in dev, SMTP for operators). An opt-in gate (REQUIRE_EMAIL_VERIFICATION, default off so SMTP-less self-hosts and pre-existing accounts aren't locked out) refuses sessions for accounts without a verified email — login is denied and existing sessions stop resolving until the address is verified. OIDC and social providers are Phase 5. Every write records an attributable actor in the audit log.
  • Instance owner / operator (orthogonal to the per-tree roles): the account(s) whose email is in OWNER_EMAIL and is verified are the instance operator(s), with the owner-only /api/v1/admin surface (operational status, instance-wide config). Derived from the env at request time — no DB column. It is an operator/config role, not a privacy bypass: it grants no read access to other users' private trees or living-person PII. (is_instance_owner in api/deps.py.)
  • Postgres pg_trgm for fuzzy name matching (Mueller/Müller/Muller), plus date and place filters.
  • All queries are filtered through the privacy engine; a non-owner can never surface a redacted living person via search.
  • Place normalization uses the gazetteer's historical variants so a search hits regardless of which era's name was recorded.

11. Background jobs

Worker-handled, queue-driven:

  • GEDCOM import (with mapping report) and export
  • Media processing (thumbnails, EXIF stripping, checksums)
  • Source-connector scraping
  • Hint candidate generation + ranking
  • Scheduled soft-delete purge (30-day window)
  • Outbound email via SMTP

Jobs are idempotent and retryable; an external failure degrades gracefully rather than corrupting state.

12. Deployment & CI/CD

  • Images are built by Gitea Actions (runs-on: docker) and pushed to the Gitea container registry, one package per component (provenance-backend, provenance-frontend) linked to the repo.
  • Split push/pull endpoints (mirrors the drawbar setup): CI pushes to the LAN registry endpoint 192.168.0.2:1234 over plain HTTP (buildx configured insecure/http) to bypass the Cloudflare request-body limit; servers pull from the public git.jpaul.io FQDN (TLS via Cloudflare). Same Gitea registry, two front doors. Auth uses the REGISTRY_TOKEN Actions secret.
  • Tag scheme: test-main (current main), test-sha-<long> (rollback pins), the component version, and latest on v* tags.
  • Servers pull new images to deploy — no build on the host. The deploy compose references git.jpaul.io/justin/provenance-{backend,frontend}:${IMAGE_TAG:-test-main}; docker-compose.dev.yml is a local-build override.
  • Caddy terminates TLS and reverse-proxies frontend + backend. Cloudflare Tunnel is the preferred ingress (no open inbound ports) but is never required; a plain Caddy-on-a-public-host deployment is equally supported.
  • Configuration is entirely environment-driven (twelve-factor). One .env plus the compose file is enough to stand up a deployment; the backend/worker/migrate services read it via env_file, so every setting in app/core/config.py is configurable without a compose edit.
  • Migrations run on backend start (RUN_MIGRATIONS=1) and via a one-shot migrate compose service, so an image pull + restart is a complete upgrade. A schema-drift guard (defense in depth) makes a half-applied deploy loud rather than a silent storm of 500s: /health/ready returns 503 and startup logs a CRITICAL SCHEMA DRIFT line when the DB's alembic_version is behind the heads baked into the image (app/core/schema_version.py).
  • Backups: a one-command operator script (deploy/backup.shpg_dump + MinIO object sync, see deploy/BACKUP.md) plus a per-account ZIP export; restore is the inverse.

Repository layout (as scaffolded):

/backend           # FastAPI, uv-managed. app/{api/v1, services (+privacy), repositories, models, schemas, integrations (auth, mailer, objectstore, models = LLM/embedding providers), core}; migrations/ = Alembic
/deploy            # docker-compose.yml (+ docker-compose.dev.yml), Caddyfile, .env.example, backup.sh + BACKUP.md
/.gitea/workflows  # Gitea Actions: build images → Gitea registry
/frontend          # Next.js (App Router, TS, Tailwind). app/ pages, lib/api (openapi-typescript client), components/ui, Dockerfile (standalone)

The compose stack runs postgres (pgvector image — includes pgvector; pg_trgm ships in contrib), minio, a one-shot migrate job, backend, the worker (same image as backend, worker mode — runs the scheduled soft-delete purge), caddy, and an optional cloudflared tunnel. The backend exposes /health (liveness) and /health/ready (Postgres reachability + schema-drift check).

13. Observability

  • Structured (JSON) logs from backend and worker.
  • Health/readiness endpoints for the proxy and orchestrator.
  • Metrics endpoint (Prometheus-style) for queue depth, job outcomes, request latency, and model/connector call results.

14. Open architectural questions

Tracked alongside the PRD's open questions:

  • Queue backend default: Postgres-based for zero-extra-service self-hosting vs. Redis for throughput — likely ship the Postgres option first, Redis as an opt-in.
  • Embeddings source: hosted provider vs. local model for the hint engine; and whether early phases even need embeddings or can ship with fuzzy+rules first.
  • Geometry stack: plain coordinates initially vs. PostGIS for real parcel geometry later (PostGIS is a bigger self-host ask, so it stays optional/deferred).
  • GEDCOM extension fidelity: how aggressively to preserve vendor extension tags on round-trip.