Files
justin 447daf7fa8 docs: bring all documentation current with shipped work
A multi-agent audit of every doc against the code surfaced ~50 stale/missing
items (the roadmap/status docs and the backlog had fallen behind the code).
This catches them up:

- CLAUDE.md: phase status was ~3 phases stale ("Phase 1 is next" while Phase 1 +
  chunks of 2 & 4 shipped). Rewrote the status list; added a model-provider
  tech-stack entry; updated repo-layout (integrations objectstore/models,
  deploy backup.sh/dev compose).
- ARCHITECTURE.md: §6 privacy engine described 3 visibility levels — corrected to
  the shipped 4 (adds site_members); documented per-tree AI policy on Tree,
  LLMProvider/EmbeddingProvider split + registry, ChangeProposal origin/status/
  operations, verified-email session gate, instance-owner role, schema-drift
  guard, and the env_file config model.
- PRD.md: 4-level visibility in US-040/§5.5, instance-owner role (§5.1/§5.11),
  per-tree AI policy (§5.8), §8 sequencing annotated with shipped status, header
  date/status bumped.
- README.md: 4-level privacy; softened "Full GEDCOM 7" to the 5.5.1/7 common
  subset; noted backups + instance-owner admin; moved property/land to an
  explicit "where it's headed" (no property models exist yet).
- BACKLOG.md: flipped ~15 shipped-but-open rows to Have (ChangeProposal, provider
  abstraction, GEDCOM citation export, membership management, operator backup,
  email-verification gate, per-tree AI policy, instance owner, the whole
  visibility/public-viewing/child-resource-redaction cluster #41-#51/#46), and
  reconciled the executive summary, "current defects" list, quick wins, and
  differentiators. Left genuinely-open items (citation/source redaction, sitemap,
  per-tree noindex, scoped-token API) accurately open.
- .env.example: dropped "SMTP wired in a later phase"; documented the worker
  purge knobs, S3_PRESIGN_TTL, COOKIE_NAME; removed a stray duplicate line.
- design/: tree-visibility.md and change-proposal.md marked Shipped; corrected
  the redaction approach (reuses member schemas, not a separate PublicPersonRead)
  and the apply() rollback claim (v1 is not cross-op transactional), and marked
  rate-limiting/sitemap/noindex as deferred.

No code changes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Justin Paul <justin@jpaul.me>
2026-06-10 21:05:29 -04:00

169 lines
8.4 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Design note: tree visibility & the public viewing surface
Status: **Shipped (#41-#51)**. Owner: Justin. Created 2026-06-09.
This is a privacy-critical change (it created the first anonymous read surface in
Provenance). Per CLAUDE.md, it was designed before code and shipped in small,
individually-reviewable PRs, with tests on the privacy engine and the public read
path landing before any anonymous endpoint was exposed.
## 1. The model
Visibility flattens **two axes***who may read* and *how discoverable* — into one
ordered enum for the UI:
| Level | Anonymous (no login) | Any logged-in user | Tree members | In-app directory | Search-indexed |
|---|---|---|---|---|---|
| `public` — anyone on the web | ✅ view¹ | ✅ view¹ | ✅ full | ✅ listed to everyone | ✅ sitemap + indexable |
| `site_members` — Public, Site Members | ❌ | ✅ view¹ | ✅ full | ✅ listed to logged-in users | ❌ (`noindex`) |
| `unlisted` — anyone with the link | ✅ via direct link¹ | ✅ via link¹ | ✅ full | ❌ never listed | ❌ (`noindex`) |
| `private` | ❌ | ❌ | ✅ full | ❌ | ❌ |
¹ **Every non-member view passes through the privacy engine.** Living people are
redacted, and per-person `private` hides / `public` reveals, exactly as
`person_visibility()` already does (`backend/app/services/privacy.py:100-110`).
This is the single enforcement point — no public code path may issue a raw query.
Decisions captured (2026-06-09):
- **Unlisted** = anyone with the link, no account required. The link must be
**unguessable** (the tree UUID is already non-enumerable; do not add a public
integer id). Unlisted trees are excluded from the directory and sitemap and
served `noindex`.
- **Public** discovery for v1 includes **an in-app public browse/search**, not
just search-engine indexing.
- **Public Site Members** = *any* registered account on this instance (not an
invite list — that is already tree membership / `private`).
## 2. Data model
`TreeVisibility` enum (`backend/app/models/enums.py`) gains a value:
```
public # anyone on the web
site_members # any authenticated user of this instance <-- NEW
unlisted # anyone with the link
private # members only (default)
```
- Alembic migration to `ALTER TYPE tree_visibility ADD VALUE 'site_members'`
(Postgres enum add-value cannot run inside a transaction with other DDL — use
`op.execute` with autocommit, separate migration).
- Default stays `private`. Existing rows unchanged.
- `TreeRead`/`TreeUpdate`/`TreeCreate` schemas already carry the enum; they pick
up the new value automatically. The OpenAPI client regen (`gen:api`) exposes it
to the frontend.
## 3. Privacy engine
`can_view_tree()` today treats `public` and `unlisted` identically and ignores
whether the viewer is anonymous vs authenticated (`privacy.py:44-49`). Replace the
final line with explicit branching on viewer auth state:
```
if membership: return True # members always
match tree.visibility:
public, unlisted: return True # anonymous OK (unlisted gated only by knowing the link)
site_members: return user_id is not None # any logged-in account
private: return False
```
`person_visibility()` is unchanged — it already redacts living/private people for
non-members. Add focused unit tests: anonymous + each visibility; living person
redacted on public/unlisted; `site_members` denies anonymous but allows a
logged-in non-member; `private` denies both.
## 4. The anonymous read path (the careful part)
**Shipped: a dedicated read-only public API namespace**, not optional-auth on the
existing endpoints. Rationale: it is far easier to audit a small, purpose-built
surface that *always* funnels through `person_visibility` than to weaken the
membership checks on the authenticated endpoints and hope every branch is covered.
- Router `app/api/v1/public.py`, mounted at `/api/v1/public`, with an
**optional-auth** dependency `CurrentUserOrNone` (returns `User | None`; never
401s). Contrast with `CurrentUser` (`deps.py:30-36`) which hard-401s.
- Endpoints (read-only; no create/update/delete):
- `GET /public/trees` — directory: lists `public` to everyone; additionally
lists `site_members` when the caller is authenticated. Paginated, search via
existing `pg_trgm`. Never lists `unlisted`/`private`.
- `GET /public/trees/{id}` — tree metadata if `can_view_tree(user_or_none)`.
- `GET /public/trees/{id}/persons`, `/persons/{pid}`, `/persons/{pid}/names`,
`/relationships`, `/events` — each filtered through `person_visibility`.
(Media is not exposed on the public surface yet — deferred.)
- **Redaction happens in the service, before serialization** — this is the safety
guarantee. It did **not** ship as a separate `PublicPersonRead` schema (that
recommendation was not adopted): the public router **reuses the member read
schemas** (`PersonRead`, `RelationshipRead`, `EventRead`, `NameRead`), and only
the tree projection (`PublicTreeRead`) is distinct. Safety comes from
`public_view_service` resolving `person_visibility` and then **dropping hidden
rows and redacting possibly-living people** (`person_service._redact` rewrites
the name to "Living person", etc.) *before* a row is ever validated into a
schema. No route hands a raw row to the serializer.
- **Rate limiting** on the public namespace (per-IP) is **deferred** — it is not
implemented in the app and may be handled at the Caddy edge if needed.
- **Audit**: count public reads; do not log PII.
## 5. Frontend public pages
- New **server-rendered** routes outside the authed app shell, e.g.
`/p/[treeId]` (tree), `/p/[treeId]/[personId]` (person), `/explore` (directory).
Server components fetch the `/api/v1/public/*` endpoints; no login redirect.
- `robots`: ships a coarse `allow: ["/", "/p/"]` rule (`frontend/app/robots.ts`)
that keeps the authed app out of the index. Per-tree `noindex, nofollow` meta
for `unlisted`/`site_members` and a `public`-only **sitemap** did **not** ship —
both are **deferred** follow-ups (per-tree noindex needs server rendering;
meanwhile `unlisted`/`site_members` trees aren't linked or listed, so they
aren't crawl-discoverable).
- The directory `/explore` is anonymous for `public`; shows `site_members` trees
only to logged-in users.
- Reuse the tree/person view components where possible, fed by the redacted
schema.
## 6. UI control
Update the visibility dropdown (`frontend/app/trees/page.tsx`, shipped in PR #41)
from 3 to 4 options with helper text:
```
Private — only you and people you invite
Public Members — any signed-in user on this site
Unlisted — anyone with the link (not listed or indexed)
Public — anyone on the web; listed and search-indexable
```
A short confirmation when switching *to* `public` ("This makes <tree> visible to
anyone on the web. Living people stay hidden.") is worthwhile given the stakes.
## 7. Guardrails / invariants
- One enforcement point: every public response is built from `person_visibility`
output. No raw repository reads in the public router.
- Living-person protection holds regardless of tree visibility.
- Unlisted relies on UUID unguessability; never expose a sequential public id.
- Per-tree `noindex` (everything except `public`) and a `public`-only sitemap are
**deferred** (see §5); today `robots.ts` keeps the authed app out of the index
and `unlisted`/`site_members` trees aren't linked or listed.
- Tests gate the merge: privacy-engine matrix + an integration test that hits the
public endpoints anonymously and asserts no living-person PII leaks.
## 8. Suggested phasing (small PRs)
1. Enum value + migration + regen client (+ dropdown → 4 options). No behavior
change yet for non-members.
2. Privacy-engine branching + unit tests.
3. Public read API namespace (optional-auth, redacted schema, rate limit) + tests.
4. Public frontend pages (`/p/...`) + robots/sitemap.
5. In-app `/explore` directory + search.
Steps 23 are the privacy-critical core and should be reviewed hardest.
## 9. Open questions
- Caching: public pages are cacheable for SEO, but cache keys must not blur the
redacted-vs-member rendering. Likely: cache only the anonymous projection at the
edge; never cache member responses.
- Do `site_members` trees appear in the sitemap for logged-in crawling? (Default:
no — `noindex`.)
- Per-tree opt-out of the directory even when `public`? (Probably unnecessary;
`unlisted` already covers "reachable but not listed.")