Files
provenance/CLAUDE.md
T
justin e5a8713293 Document core-model decisions in CLAUDE.md and ARCHITECTURE
Records the landed data model and backend layout, the Phase 0 tree-scoping of Place (vs. the eventual shared gazetteer), and the temporary X-User-Id auth shim.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Justin Paul <justin@jpaul.me>
2026-06-06 10:40:19 -04:00

7.1 KiB

CLAUDE.md

Operating guide for Claude Code (and any AI assistant) working in this repository. Read this first, then docs/PRD.md and docs/ARCHITECTURE.md.

What this project is

Provenance is self-hostable, source-available software for tracing where you come from — family and land. It combines a genealogy application (people, relationships, events, sources, media) with property chain-of-title tracking (parcels, deeds, ownership events), a privacy model, an AI research assistant, and a cross-tree hint system. It is multi-tenant and container-native.

The name is the thesis: provenance means a documented chain of custody. Every fact should link to where it came from.

Non-negotiable rules

These are product invariants, not preferences. Do not violate them, and flag any task that seems to require it:

  1. The AI assistant never writes autonomously. Assistant "write" operations emit a ChangeProposal (a structured diff) that a human approves, edits, or rejects in the UI. There must be no code path where a model response mutates tree data directly. This is structural — enforce it in the type system / service boundaries, not just by convention.
  2. Privacy has a single enforcement point. All reads — API, server-rendered public pages, search, hints, assistant — resolve visibility through one privacy engine in the service layer. Never add a query path that returns rows without passing through it.
  3. Living people are protected by default. Non-owners do not see PII for a person who is (or may be) living. See the living-person rule in ARCHITECTURE §6.
  4. Hint matching is anonymous until mutual consent. A match notification must reveal nothing identifying about the other user or any living person. Identities exchange only after both sides opt in.
  5. Sources are first-class. Don't model citations as free-text afterthoughts. A Source is a reusable entity; a Citation links it to a specific fact.
  6. Only legal data sources. Ship scrapers/connectors only for permissible sources (FamilySearch API, Find A Grave, WikiTree, BLM/GLO, USGS, public-domain newspapers, public county records). Never add connectors for paywalled/terms-prohibited sites (Ancestry, MyHeritage, 23andMe).
  7. Everything is configurable via environment. Auth, mail, object storage, database, model providers, scrapers — all twelve-factor. No hard-coded endpoints or keys.

Tech stack

  • Frontend: Next.js (App Router) + React + TypeScript + Tailwind + shadcn/ui. Mobile-first, server components for public/SEO pages, generated TS client from the backend OpenAPI spec.
  • Backend: Python + FastAPI, async, layered (API → service → repository → domain). SQLAlchemy. OpenAPI is the contract.
  • Worker: same image as backend in worker mode; queue-driven async jobs.
  • Database: PostgreSQL with pg_trgm (fuzzy search) and pgvector (match ranking).
  • Object storage: S3-compatible (MinIO for self-host).
  • Edge: Caddy reverse proxy; optional Cloudflare Tunnel (preferred ingress, never required).
  • Email: operator-configured SMTP.
  • CI/CD: Gitea Actions on git.jpaul.io build container images to the Gitea registry; servers pull to deploy.

Pick libraries consistent with this stack. If you introduce a significant dependency or a new service, note it in ARCHITECTURE.md in the same change.

Repository layout

/                  # docs and project meta (this file, README, LICENSE, COC, CONTRIBUTING)
/docs              # PRD.md, ARCHITECTURE.md
/backend           # FastAPI service (uv-managed). app/{api/v1, services (+ privacy engine), repositories, models, schemas, core}; migrations/ = Alembic
/deploy            # docker-compose.yml, Caddyfile, .env.example — the self-host stack
/.gitea/workflows  # Gitea Actions CI (build images → Gitea registry)
/frontend          # Next.js app — not yet scaffolded (Phase 0, after the deploy story)

Phase 0 is landing deploy-first: the compose stack (Postgres + MinIO + Caddy + a minimal FastAPI backend exposing /health and /health/ready) and CI come before the real data model and the frontend. Backend dependencies are managed with uv; migrations use Alembic. The core data model (ARCHITECTURE §5) and its initial migration have landed; local auth and the frontend are next. A temporary X-User-Id header shim stands in for auth until that slice. Keep this section current as the tree grows.

Where to start

The roadmap is phased in PRD §8. Build in dependency order. Phase 0 — Foundation is the current target:

  1. Backend skeleton (FastAPI, async, layered) + Postgres + migrations
  2. Core data model from ARCHITECTURE §5 — start with User, Tree, TreeMembership, Person, Name, Relationship, Event, Place, Source, Citation, AuditEntry, soft-delete support
  3. Local auth (password + email verification) behind the AuthProvider interface
  4. Frontend scaffold (Next.js) wired to the API via the generated client
  5. The deploy stack: compose for app + postgres + objectstore, Caddy config, env-driven settings
  6. CI/CD: Gitea Actions building images to the registry

Don't get ahead of the phases. GEDCOM lands before the assistant (so AI writes target a stable model); property follows a tested people graph; hints come last because they need multiple populated trees. If you think the order is wrong, raise it rather than reordering silently.

Conventions

  • Sign off every commit with the DCO. Use git commit -s. Commits without a Signed-off-by line cannot be merged. See CONTRIBUTING.md.
  • Commit messages: concise summary line; body explaining why when it isn't obvious. One logical change per commit where practical.
  • Tests accompany new behavior once a test surface exists.
  • Docs travel with code: update PRD/ARCHITECTURE in the same change when scope or design shifts.
  • Privacy/assistant/hint code gets extra care — these are the areas where bugs do real harm. Prefer a design note before a large change.
  • No secrets in the repo. Config via env; provide .env.example with placeholders.

License & contribution terms

Provenance is source-available under BUSL-1.1 (see LICENSE): free for personal/family/non-commercial use, no third-party commercial hosting, and each release converts to AGPL-3.0 four years after it ships. The DCO sign-off keeps the licensing chain clean so the maintainer can manage that conversion and a possible future hosted offering. Don't add code under an incompatible license, and don't vendor dependencies whose licenses conflict with eventual AGPL distribution.

Owner & contact

Maintainer: Justin Paul (justin@jpaul.io). This deployment targets a home lab: Authentik at auth.jpaul.io for auth, mail.jpaul.io for SMTP, behind Caddy + Cloudflare Tunnel.

Open questions (don't assume answers)

Parked in PRD §11 and ARCHITECTURE §14: telemetry (opt-in anonymous vs none), embeddings provider for matching, DNA as future-phase vs permanent non-goal, native mobile timing, hosted-SaaS model, queue backend default (Postgres vs Redis), and PostGIS adoption. If a task depends on one of these, surface the dependency instead of picking silently.