# CLAUDE.md Operating guide for Claude Code (and any AI assistant) working in this repository. Read this first, then [docs/PRD.md](docs/PRD.md) and [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md). ## What this project is **Provenance** is self-hostable, source-available software for tracing where you come from — **family *and* land**. It combines a genealogy application (people, relationships, events, sources, media) with **property chain-of-title** tracking (parcels, deeds, ownership events), a privacy model, an AI research assistant, and a cross-tree hint system. It is multi-tenant and container-native. The name is the thesis: *provenance* means a documented chain of custody. Every fact should link to where it came from. ## Non-negotiable rules These are product invariants, not preferences. Do not violate them, and flag any task that seems to require it: 1. **The AI assistant never writes autonomously.** Assistant "write" operations emit a `ChangeProposal` (a structured diff) that a human approves, edits, or rejects in the UI. There must be no code path where a model response mutates tree data directly. This is structural — enforce it in the type system / service boundaries, not just by convention. 2. **Privacy has a single enforcement point.** All reads — API, server-rendered public pages, search, hints, assistant — resolve visibility through one privacy engine in the service layer. Never add a query path that returns rows without passing through it. 3. **Living people are protected by default.** Non-owners do not see PII for a person who is (or may be) living. See the living-person rule in ARCHITECTURE §6. 4. **Hint matching is anonymous until mutual consent.** A match notification must reveal nothing identifying about the other user or any living person. Identities exchange only after both sides opt in. 5. **Sources are first-class.** Don't model citations as free-text afterthoughts. A `Source` is a reusable entity; a `Citation` links it to a specific fact. 6. **Only legal data sources.** Ship scrapers/connectors only for permissible sources (FamilySearch API, Find A Grave, WikiTree, BLM/GLO, USGS, public-domain newspapers, public county records). Never add connectors for paywalled/terms-prohibited sites (Ancestry, MyHeritage, 23andMe). 7. **Everything is configurable via environment.** Auth, mail, object storage, database, model providers, scrapers — all twelve-factor. No hard-coded endpoints or keys. 8. **Full CRUD on every object.** Every stored entity (person, name, event, relationship, source, citation, media, tree, …) must support create, read, **update**, and delete — in the API *and* the UI. Historical research is constant correction and new information, so nothing is write-once. Any new feature or data type ships with all four operations; an entity you can create but not edit is a bug. ## Tech stack - **Frontend:** Next.js (App Router) + React + TypeScript + Tailwind + shadcn/ui. Mobile-first, server components for public/SEO pages, generated TS client from the backend OpenAPI spec. - **Backend:** Python + FastAPI, async, layered (API → service → repository → domain). SQLAlchemy. OpenAPI is the contract. - **Worker:** same image as backend in worker mode; queue-driven async jobs. - **Database:** PostgreSQL with `pg_trgm` (fuzzy search) and `pgvector` (match ranking). - **Object storage:** S3-compatible (MinIO for self-host). - **Edge:** Caddy reverse proxy; optional Cloudflare Tunnel (preferred ingress, never required). - **Email:** operator-configured SMTP. - **Model providers:** pluggable `LLMProvider` + `EmbeddingProvider` abstraction (ABCs) with Null / Anthropic / OpenAI-compatible (OpenAI, xAI, Ollama) implementations; an operator configures one or more via env and they're selectable by name through a registry (per-tree AI policy + `default_llm_provider`/`default_embedding_provider`). - **CI/CD:** Gitea Actions build per-component images. **Push** to the LAN registry `192.168.0.2:1234` (plain HTTP, bypasses Cloudflare's body limit); **pull** via the public `git.jpaul.io` FQDN. Servers pull to deploy — no host build. Mirrors the drawbar setup; see [[gitea-lan-push-fqdn-pull]]. Pick libraries consistent with this stack. If you introduce a significant dependency or a new service, note it in ARCHITECTURE.md in the same change. ## Repository layout ``` / # docs and project meta (this file, README, LICENSE, COC, CONTRIBUTING) /docs # PRD.md, ARCHITECTURE.md /backend # FastAPI service (uv-managed). app/{api/v1, services (+ privacy engine), repositories, models, schemas, integrations (auth, mailer, objectstore, models = pluggable LLM/embedding providers), core}; migrations/ = Alembic /deploy # docker-compose.yml (+ docker-compose.dev.yml), Caddyfile, .env.example, backup.sh + BACKUP.md (one-command pg_dump + MinIO backup) — the self-host stack /.gitea/workflows # Gitea Actions CI (build images → Gitea registry) /frontend # Next.js (App Router, TS, Tailwind, shadcn-style UI). app/ pages, lib/api generated OpenAPI client, components/ui ``` Phase 0 landed **deploy-first**: the compose stack (Postgres + MinIO + Caddy + FastAPI backend) and CI before the data model and frontend. Backend deps use **uv**; migrations use **Alembic**. Status (keep current as the tree grows): - **Phase 0 — Foundation: complete** and running live (core data model, local auth behind `AuthProvider`, Next.js frontend). - **Phase 1 — Core tree: complete.** Media (upload/serve), soft-delete + recovery UI, full CRUD across entities, and the 4-level tree visibility/privacy model (#41–#51). - **Phase 2 — substantially landed.** GEDCOM import (preview→apply, duplicate-aware) and export (citation-preserving, #232); fuzzy name search (pg_trgm) + the public `/explore` directory. Living-person protection is still hardening. - **Phase 4 — AI assistant foundations landed.** Pluggable `LLMProvider`/`EmbeddingProvider` abstraction + multi-provider registry (Anthropic/OpenAI/xAI/Ollama, #235/#237), the **ChangeProposal** propose-then-confirm flow (#236), and per-tree AI model policy (#238). The assistant's *tool surface that emits proposals* is the remaining piece. - Also shipped: tree membership management (#233), an **instance owner/operator** role (`OWNER_EMAIL`, #240), a schema-drift readiness guard (#239), and a one-command operator backup (#234). - **Not built yet:** Phase 3 (Property — parcels/deeds/chain-of-title; no property models exist), Phase 5 (OIDC/social auth — only the `AuthProvider` ABC exists), and cross-tree hints (last; needs multiple populated trees + the embedding provider). ## Where to start The roadmap is phased in PRD §8. Build in dependency order. **Phases 0 and 1 are complete**, Phase 2 is substantially done, and Phase 4's AI foundations have shipped (see the status list above). The biggest unbuilt areas are **Phase 3 (Property)** and **Phase 5 (OIDC/social auth)** — likely current targets. For reference, Phase 0 covered: 1. Backend skeleton (FastAPI, async, layered) + Postgres + migrations 2. Core data model from ARCHITECTURE §5 — start with User, Tree, TreeMembership, Person, Name, Relationship, Event, Place, Source, Citation, AuditEntry, soft-delete support 3. Local auth (password + email verification) behind the `AuthProvider` interface 4. Frontend scaffold (Next.js) wired to the API via the generated client 5. The deploy stack: `compose` for app + postgres + objectstore, Caddy config, env-driven settings 6. CI/CD: Gitea Actions building images to the registry Don't get ahead of the phases. GEDCOM and the assistant's propose-diff foundation (provider abstraction + ChangeProposal approval flow) have shipped; the remaining dependency-ordered work is **Property** (Phase 3, on top of the tested people graph), then richer collaboration/audit UI, with **cross-tree hints last** (they need multiple populated trees and the embedding provider). If you think the order is wrong, raise it rather than reordering silently. ## Conventions - **Sign off every commit with the DCO.** Use `git commit -s`. Commits without a `Signed-off-by` line cannot be merged. See [CONTRIBUTING.md](CONTRIBUTING.md). - **Commit messages:** concise summary line; body explaining *why* when it isn't obvious. One logical change per commit where practical. - **Tests** accompany new behavior once a test surface exists. - **Docs travel with code:** update PRD/ARCHITECTURE in the same change when scope or design shifts. - **Privacy/assistant/hint code gets extra care** — these are the areas where bugs do real harm. Prefer a design note before a large change. - **No secrets in the repo.** Config via env; provide `.env.example` with placeholders. ## Patched dependencies (family-chart) The tree view uses **family-chart** (d3-based). Two adjustments live in the repo: - **CSS is vendored** at `frontend/app/trees/[id]/tree/chart.css` — the package blocks its CSS subpath export, so we copy it in. - **The library is patched** via `patch-package` (`frontend/patches/family-chart+0.9.0.patch`, applied by the `postinstall` hook; the backend/frontend Dockerfiles `COPY patches` before install). Both hunks touch `dist/family-chart.js` **and** `dist/family-chart.esm.js` (the app loads the `esm` build). Current fixes: 1. **Spouse-centering layout** (`setupSpouses` / `sortChildrenWithSpouses`) — center a person between two spouses with children under the correct pair. 2. **`cardToMiddle` vertical centering** — the lib scaled `datum.x` by the zoom factor `k` but not `datum.y`, so "fly to a node" drifted vertically at any zoom ≠ 1; we add the missing `* k`. To change a patch: edit the file(s) under `node_modules/family-chart/dist/`, then `cd frontend && npx patch-package family-chart` to regenerate, and verify with `npx patch-package --error-on-fail`. **Upstream these.** Both are general library bugfixes, not app-specific. The `cardToMiddle` fix is submitted — **donatso/family-chart#103** (issue **#102**). The spouse-layout fix still needs upstreaming; do it when there's time. When a fixed release ships, drop the corresponding patch hunk **and** remove any in-app compensation (e.g. the `cardToMiddle` caller in `tree/page.tsx` passes raw `y` precisely because the patch fixes it — pre-scaling there too would double-correct). ## License & contribution terms Provenance is **source-available** under **BUSL-1.1** (see [LICENSE](LICENSE)): free for personal/family/non-commercial use, no third-party commercial hosting, and each release converts to **AGPL-3.0** four years after it ships. The DCO sign-off keeps the licensing chain clean so the maintainer can manage that conversion and a possible future hosted offering. Don't add code under an incompatible license, and don't vendor dependencies whose licenses conflict with eventual AGPL distribution. ## Brand Visual identity lives in [docs/brand/](docs/brand/) (see its README for full guidance). Use these as the frontend's design tokens: - **Ink** (primary text/marks): `#1A1A17` light / `#F2EEE6` dark - **Bronze** (accent, constant): `#A06A42` - **Paper** (knockout on bronze, constant): `#F7F3EC` - **Muted** (secondary text): `#6B6862` light / `#9A968E` dark Wordmark is a serif (heritage register); UI body/secondary text is a humanist sans. Logo lockup: `docs/brand/provenance-logo.svg`; app icon/favicon: `docs/brand/provenance-icon.svg` and `favicon.svg`. Don't recolor outside the palette or add gradients/shadows — the look is flat and warm. ## Owner & contact Maintainer: **Justin Paul** (`justin@jpaul.io`). This deployment targets a home lab: Authentik at `auth.jpaul.io` for auth, `mail.jpaul.io` for SMTP, behind Caddy + Cloudflare Tunnel. ## Open questions (don't assume answers) Parked in PRD §11 and ARCHITECTURE §14: telemetry (opt-in anonymous vs none), embeddings provider for matching, DNA as future-phase vs permanent non-goal, native mobile timing, hosted-SaaS model, queue backend default (Postgres vs Redis), and PostGIS adoption. If a task depends on one of these, surface the dependency instead of picking silently.