Files

T

justin 447daf7fa8 docs: bring all documentation current with shipped work

A multi-agent audit of every doc against the code surfaced ~50 stale/missing
items (the roadmap/status docs and the backlog had fallen behind the code).
This catches them up:

- CLAUDE.md: phase status was ~3 phases stale ("Phase 1 is next" while Phase 1 +
  chunks of 2 & 4 shipped). Rewrote the status list; added a model-provider
  tech-stack entry; updated repo-layout (integrations objectstore/models,
  deploy backup.sh/dev compose).
- ARCHITECTURE.md: §6 privacy engine described 3 visibility levels — corrected to
  the shipped 4 (adds site_members); documented per-tree AI policy on Tree,
  LLMProvider/EmbeddingProvider split + registry, ChangeProposal origin/status/
  operations, verified-email session gate, instance-owner role, schema-drift
  guard, and the env_file config model.
- PRD.md: 4-level visibility in US-040/§5.5, instance-owner role (§5.1/§5.11),
  per-tree AI policy (§5.8), §8 sequencing annotated with shipped status, header
  date/status bumped.
- README.md: 4-level privacy; softened "Full GEDCOM 7" to the 5.5.1/7 common
  subset; noted backups + instance-owner admin; moved property/land to an
  explicit "where it's headed" (no property models exist yet).
- BACKLOG.md: flipped ~15 shipped-but-open rows to Have (ChangeProposal, provider
  abstraction, GEDCOM citation export, membership management, operator backup,
  email-verification gate, per-tree AI policy, instance owner, the whole
  visibility/public-viewing/child-resource-redaction cluster #41-#51/#46), and
  reconciled the executive summary, "current defects" list, quick wins, and
  differentiators. Left genuinely-open items (citation/source redaction, sitemap,
  per-tree noindex, scoped-token API) accurately open.
- .env.example: dropped "SMTP wired in a later phase"; documented the worker
  purge knobs, S3_PRESIGN_TTL, COOKIE_NAME; removed a stray duplicate line.
- design/: tree-visibility.md and change-proposal.md marked Shipped; corrected
  the redaction approach (reuses member schemas, not a separate PublicPersonRead)
  and the apply() rollback claim (v1 is not cross-op transactional), and marked
  rate-limiting/sitemap/noindex as deferred.

No code changes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Justin Paul <justin@jpaul.me>

2026-06-10 21:05:29 -04:00

10 KiB

Raw Blame History

CLAUDE.md

Operating guide for Claude Code (and any AI assistant) working in this repository. Read this first, then docs/PRD.md and docs/ARCHITECTURE.md.

What this project is

Provenance is self-hostable, source-available software for tracing where you come from — family and land. It combines a genealogy application (people, relationships, events, sources, media) with property chain-of-title tracking (parcels, deeds, ownership events), a privacy model, an AI research assistant, and a cross-tree hint system. It is multi-tenant and container-native.

The name is the thesis: provenance means a documented chain of custody. Every fact should link to where it came from.

Non-negotiable rules

These are product invariants, not preferences. Do not violate them, and flag any task that seems to require it:

The AI assistant never writes autonomously. Assistant "write" operations emit a ChangeProposal (a structured diff) that a human approves, edits, or rejects in the UI. There must be no code path where a model response mutates tree data directly. This is structural — enforce it in the type system / service boundaries, not just by convention.
Privacy has a single enforcement point. All reads — API, server-rendered public pages, search, hints, assistant — resolve visibility through one privacy engine in the service layer. Never add a query path that returns rows without passing through it.
Living people are protected by default. Non-owners do not see PII for a person who is (or may be) living. See the living-person rule in ARCHITECTURE §6.
Hint matching is anonymous until mutual consent. A match notification must reveal nothing identifying about the other user or any living person. Identities exchange only after both sides opt in.
Sources are first-class. Don't model citations as free-text afterthoughts. A Source is a reusable entity; a Citation links it to a specific fact.
Only legal data sources. Ship scrapers/connectors only for permissible sources (FamilySearch API, Find A Grave, WikiTree, BLM/GLO, USGS, public-domain newspapers, public county records). Never add connectors for paywalled/terms-prohibited sites (Ancestry, MyHeritage, 23andMe).
Everything is configurable via environment. Auth, mail, object storage, database, model providers, scrapers — all twelve-factor. No hard-coded endpoints or keys.
Full CRUD on every object. Every stored entity (person, name, event, relationship, source, citation, media, tree, …) must support create, read, update, and delete — in the API and the UI. Historical research is constant correction and new information, so nothing is write-once. Any new feature or data type ships with all four operations; an entity you can create but not edit is a bug.

Tech stack

Frontend: Next.js (App Router) + React + TypeScript + Tailwind + shadcn/ui. Mobile-first, server components for public/SEO pages, generated TS client from the backend OpenAPI spec.
Backend: Python + FastAPI, async, layered (API → service → repository → domain). SQLAlchemy. OpenAPI is the contract.
Worker: same image as backend in worker mode; queue-driven async jobs.
Database: PostgreSQL with pg_trgm (fuzzy search) and pgvector (match ranking).
Object storage: S3-compatible (MinIO for self-host).
Edge: Caddy reverse proxy; optional Cloudflare Tunnel (preferred ingress, never required).
Email: operator-configured SMTP.
Model providers: pluggable LLMProvider + EmbeddingProvider abstraction (ABCs) with Null / Anthropic / OpenAI-compatible (OpenAI, xAI, Ollama) implementations; an operator configures one or more via env and they're selectable by name through a registry (per-tree AI policy + default_llm_provider/default_embedding_provider).
CI/CD: Gitea Actions build per-component images. Push to the LAN registry 192.168.0.2:1234 (plain HTTP, bypasses Cloudflare's body limit); pull via the public git.jpaul.io FQDN. Servers pull to deploy — no host build. Mirrors the drawbar setup; see gitea-lan-push-fqdn-pull.

Pick libraries consistent with this stack. If you introduce a significant dependency or a new service, note it in ARCHITECTURE.md in the same change.

Repository layout

/                  # docs and project meta (this file, README, LICENSE, COC, CONTRIBUTING)
/docs              # PRD.md, ARCHITECTURE.md
/backend           # FastAPI service (uv-managed). app/{api/v1, services (+ privacy engine), repositories, models, schemas, integrations (auth, mailer, objectstore, models = pluggable LLM/embedding providers), core}; migrations/ = Alembic
/deploy            # docker-compose.yml (+ docker-compose.dev.yml), Caddyfile, .env.example, backup.sh + BACKUP.md (one-command pg_dump + MinIO backup) — the self-host stack
/.gitea/workflows  # Gitea Actions CI (build images → Gitea registry)
/frontend          # Next.js (App Router, TS, Tailwind, shadcn-style UI). app/ pages, lib/api generated OpenAPI client, components/ui

Phase 0 landed deploy-first: the compose stack (Postgres + MinIO + Caddy + FastAPI backend) and CI before the data model and frontend. Backend deps use uv; migrations use Alembic. Status (keep current as the tree grows):

Phase 0 — Foundation: complete and running live (core data model, local auth behind AuthProvider, Next.js frontend).
Phase 1 — Core tree: complete. Media (upload/serve), soft-delete + recovery UI, full CRUD across entities, and the 4-level tree visibility/privacy model (#41–#51).
Phase 2 — substantially landed. GEDCOM import (preview→apply, duplicate-aware) and export (citation-preserving, #232); fuzzy name search (pg_trgm) + the public /explore directory. Living-person protection is still hardening.
Phase 4 — AI assistant foundations landed. Pluggable LLMProvider/EmbeddingProvider abstraction + multi-provider registry (Anthropic/OpenAI/xAI/Ollama, #235/#237), the ChangeProposal propose-then-confirm flow (#236), and per-tree AI model policy (#238). The assistant's tool surface that emits proposals is the remaining piece.
Also shipped: tree membership management (#233), an instance owner/operator role (OWNER_EMAIL, #240), a schema-drift readiness guard (#239), and a one-command operator backup (#234).
Not built yet: Phase 3 (Property — parcels/deeds/chain-of-title; no property models exist), Phase 5 (OIDC/social auth — only the AuthProvider ABC exists), and cross-tree hints (last; needs multiple populated trees + the embedding provider).

Where to start

The roadmap is phased in PRD §8. Build in dependency order. Phases 0 and 1 are complete, Phase 2 is substantially done, and Phase 4's AI foundations have shipped (see the status list above). The biggest unbuilt areas are Phase 3 (Property) and Phase 5 (OIDC/social auth) — likely current targets. For reference, Phase 0 covered:

Backend skeleton (FastAPI, async, layered) + Postgres + migrations
Core data model from ARCHITECTURE §5 — start with User, Tree, TreeMembership, Person, Name, Relationship, Event, Place, Source, Citation, AuditEntry, soft-delete support
Local auth (password + email verification) behind the AuthProvider interface
Frontend scaffold (Next.js) wired to the API via the generated client
The deploy stack: compose for app + postgres + objectstore, Caddy config, env-driven settings
CI/CD: Gitea Actions building images to the registry

Don't get ahead of the phases. GEDCOM and the assistant's propose-diff foundation (provider abstraction + ChangeProposal approval flow) have shipped; the remaining dependency-ordered work is Property (Phase 3, on top of the tested people graph), then richer collaboration/audit UI, with cross-tree hints last (they need multiple populated trees and the embedding provider). If you think the order is wrong, raise it rather than reordering silently.

Conventions

Sign off every commit with the DCO. Use git commit -s. Commits without a Signed-off-by line cannot be merged. See CONTRIBUTING.md.
Commit messages: concise summary line; body explaining why when it isn't obvious. One logical change per commit where practical.
Tests accompany new behavior once a test surface exists.
Docs travel with code: update PRD/ARCHITECTURE in the same change when scope or design shifts.
Privacy/assistant/hint code gets extra care — these are the areas where bugs do real harm. Prefer a design note before a large change.
No secrets in the repo. Config via env; provide .env.example with placeholders.

License & contribution terms

Provenance is source-available under BUSL-1.1 (see LICENSE): free for personal/family/non-commercial use, no third-party commercial hosting, and each release converts to AGPL-3.0 four years after it ships. The DCO sign-off keeps the licensing chain clean so the maintainer can manage that conversion and a possible future hosted offering. Don't add code under an incompatible license, and don't vendor dependencies whose licenses conflict with eventual AGPL distribution.

Brand

Visual identity lives in docs/brand/ (see its README for full guidance). Use these as the frontend's design tokens:

Ink (primary text/marks): #1A1A17 light / #F2EEE6 dark
Bronze (accent, constant): #A06A42
Paper (knockout on bronze, constant): #F7F3EC
Muted (secondary text): #6B6862 light / #9A968E dark

Wordmark is a serif (heritage register); UI body/secondary text is a humanist sans. Logo lockup: docs/brand/provenance-logo.svg; app icon/favicon: docs/brand/provenance-icon.svg and favicon.svg. Don't recolor outside the palette or add gradients/shadows — the look is flat and warm.

Owner & contact

Maintainer: Justin Paul (justin@jpaul.io). This deployment targets a home lab: Authentik at auth.jpaul.io for auth, mail.jpaul.io for SMTP, behind Caddy + Cloudflare Tunnel.

Open questions (don't assume answers)

Parked in PRD §11 and ARCHITECTURE §14: telemetry (opt-in anonymous vs none), embeddings provider for matching, DNA as future-phase vs permanent non-goal, native mobile timing, hosted-SaaS model, queue backend default (Postgres vs Redis), and PostGIS adoption. If a task depends on one of these, surface the dependency instead of picking silently.

10 KiB Raw Blame History Unescape Escape