Files
provenance/docs/design/tree-visibility.md
T
justin 447daf7fa8 docs: bring all documentation current with shipped work
A multi-agent audit of every doc against the code surfaced ~50 stale/missing
items (the roadmap/status docs and the backlog had fallen behind the code).
This catches them up:

- CLAUDE.md: phase status was ~3 phases stale ("Phase 1 is next" while Phase 1 +
  chunks of 2 & 4 shipped). Rewrote the status list; added a model-provider
  tech-stack entry; updated repo-layout (integrations objectstore/models,
  deploy backup.sh/dev compose).
- ARCHITECTURE.md: §6 privacy engine described 3 visibility levels — corrected to
  the shipped 4 (adds site_members); documented per-tree AI policy on Tree,
  LLMProvider/EmbeddingProvider split + registry, ChangeProposal origin/status/
  operations, verified-email session gate, instance-owner role, schema-drift
  guard, and the env_file config model.
- PRD.md: 4-level visibility in US-040/§5.5, instance-owner role (§5.1/§5.11),
  per-tree AI policy (§5.8), §8 sequencing annotated with shipped status, header
  date/status bumped.
- README.md: 4-level privacy; softened "Full GEDCOM 7" to the 5.5.1/7 common
  subset; noted backups + instance-owner admin; moved property/land to an
  explicit "where it's headed" (no property models exist yet).
- BACKLOG.md: flipped ~15 shipped-but-open rows to Have (ChangeProposal, provider
  abstraction, GEDCOM citation export, membership management, operator backup,
  email-verification gate, per-tree AI policy, instance owner, the whole
  visibility/public-viewing/child-resource-redaction cluster #41-#51/#46), and
  reconciled the executive summary, "current defects" list, quick wins, and
  differentiators. Left genuinely-open items (citation/source redaction, sitemap,
  per-tree noindex, scoped-token API) accurately open.
- .env.example: dropped "SMTP wired in a later phase"; documented the worker
  purge knobs, S3_PRESIGN_TTL, COOKIE_NAME; removed a stray duplicate line.
- design/: tree-visibility.md and change-proposal.md marked Shipped; corrected
  the redaction approach (reuses member schemas, not a separate PublicPersonRead)
  and the apply() rollback claim (v1 is not cross-op transactional), and marked
  rate-limiting/sitemap/noindex as deferred.

No code changes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Justin Paul <justin@jpaul.me>
2026-06-10 21:05:29 -04:00

8.4 KiB
Raw Blame History

Design note: tree visibility & the public viewing surface

Status: Shipped (#41-#51). Owner: Justin. Created 2026-06-09.

This is a privacy-critical change (it created the first anonymous read surface in Provenance). Per CLAUDE.md, it was designed before code and shipped in small, individually-reviewable PRs, with tests on the privacy engine and the public read path landing before any anonymous endpoint was exposed.

1. The model

Visibility flattens two axeswho may read and how discoverable — into one ordered enum for the UI:

Level Anonymous (no login) Any logged-in user Tree members In-app directory Search-indexed
public — anyone on the web view¹ view¹ full listed to everyone sitemap + indexable
site_members — Public, Site Members view¹ full listed to logged-in users (noindex)
unlisted — anyone with the link via direct link¹ via link¹ full never listed (noindex)
private full

¹ Every non-member view passes through the privacy engine. Living people are redacted, and per-person private hides / public reveals, exactly as person_visibility() already does (backend/app/services/privacy.py:100-110). This is the single enforcement point — no public code path may issue a raw query.

Decisions captured (2026-06-09):

  • Unlisted = anyone with the link, no account required. The link must be unguessable (the tree UUID is already non-enumerable; do not add a public integer id). Unlisted trees are excluded from the directory and sitemap and served noindex.
  • Public discovery for v1 includes an in-app public browse/search, not just search-engine indexing.
  • Public Site Members = any registered account on this instance (not an invite list — that is already tree membership / private).

2. Data model

TreeVisibility enum (backend/app/models/enums.py) gains a value:

public        # anyone on the web
site_members  # any authenticated user of this instance   <-- NEW
unlisted      # anyone with the link
private        # members only (default)
  • Alembic migration to ALTER TYPE tree_visibility ADD VALUE 'site_members' (Postgres enum add-value cannot run inside a transaction with other DDL — use op.execute with autocommit, separate migration).
  • Default stays private. Existing rows unchanged.
  • TreeRead/TreeUpdate/TreeCreate schemas already carry the enum; they pick up the new value automatically. The OpenAPI client regen (gen:api) exposes it to the frontend.

3. Privacy engine

can_view_tree() today treats public and unlisted identically and ignores whether the viewer is anonymous vs authenticated (privacy.py:44-49). Replace the final line with explicit branching on viewer auth state:

if membership: return True              # members always
match tree.visibility:
  public, unlisted: return True         # anonymous OK (unlisted gated only by knowing the link)
  site_members:     return user_id is not None   # any logged-in account
  private:          return False

person_visibility() is unchanged — it already redacts living/private people for non-members. Add focused unit tests: anonymous + each visibility; living person redacted on public/unlisted; site_members denies anonymous but allows a logged-in non-member; private denies both.

4. The anonymous read path (the careful part)

Shipped: a dedicated read-only public API namespace, not optional-auth on the existing endpoints. Rationale: it is far easier to audit a small, purpose-built surface that always funnels through person_visibility than to weaken the membership checks on the authenticated endpoints and hope every branch is covered.

  • Router app/api/v1/public.py, mounted at /api/v1/public, with an optional-auth dependency CurrentUserOrNone (returns User | None; never 401s). Contrast with CurrentUser (deps.py:30-36) which hard-401s.
  • Endpoints (read-only; no create/update/delete):
    • GET /public/trees — directory: lists public to everyone; additionally lists site_members when the caller is authenticated. Paginated, search via existing pg_trgm. Never lists unlisted/private.
    • GET /public/trees/{id} — tree metadata if can_view_tree(user_or_none).
    • GET /public/trees/{id}/persons, /persons/{pid}, /persons/{pid}/names, /relationships, /events — each filtered through person_visibility. (Media is not exposed on the public surface yet — deferred.)
  • Redaction happens in the service, before serialization — this is the safety guarantee. It did not ship as a separate PublicPersonRead schema (that recommendation was not adopted): the public router reuses the member read schemas (PersonRead, RelationshipRead, EventRead, NameRead), and only the tree projection (PublicTreeRead) is distinct. Safety comes from public_view_service resolving person_visibility and then dropping hidden rows and redacting possibly-living people (person_service._redact rewrites the name to "Living person", etc.) before a row is ever validated into a schema. No route hands a raw row to the serializer.
  • Rate limiting on the public namespace (per-IP) is deferred — it is not implemented in the app and may be handled at the Caddy edge if needed.
  • Audit: count public reads; do not log PII.

5. Frontend public pages

  • New server-rendered routes outside the authed app shell, e.g. /p/[treeId] (tree), /p/[treeId]/[personId] (person), /explore (directory). Server components fetch the /api/v1/public/* endpoints; no login redirect.
  • robots: ships a coarse allow: ["/", "/p/"] rule (frontend/app/robots.ts) that keeps the authed app out of the index. Per-tree noindex, nofollow meta for unlisted/site_members and a public-only sitemap did not ship — both are deferred follow-ups (per-tree noindex needs server rendering; meanwhile unlisted/site_members trees aren't linked or listed, so they aren't crawl-discoverable).
  • The directory /explore is anonymous for public; shows site_members trees only to logged-in users.
  • Reuse the tree/person view components where possible, fed by the redacted schema.

6. UI control

Update the visibility dropdown (frontend/app/trees/page.tsx, shipped in PR #41) from 3 to 4 options with helper text:

Private          — only you and people you invite
Public  Members — any signed-in user on this site
Unlisted         — anyone with the link (not listed or indexed)
Public           — anyone on the web; listed and search-indexable

A short confirmation when switching to public ("This makes visible to anyone on the web. Living people stay hidden.") is worthwhile given the stakes.

7. Guardrails / invariants

  • One enforcement point: every public response is built from person_visibility output. No raw repository reads in the public router.
  • Living-person protection holds regardless of tree visibility.
  • Unlisted relies on UUID unguessability; never expose a sequential public id.
  • Per-tree noindex (everything except public) and a public-only sitemap are deferred (see §5); today robots.ts keeps the authed app out of the index and unlisted/site_members trees aren't linked or listed.
  • Tests gate the merge: privacy-engine matrix + an integration test that hits the public endpoints anonymously and asserts no living-person PII leaks.

8. Suggested phasing (small PRs)

  1. Enum value + migration + regen client (+ dropdown → 4 options). No behavior change yet for non-members.
  2. Privacy-engine branching + unit tests.
  3. Public read API namespace (optional-auth, redacted schema, rate limit) + tests.
  4. Public frontend pages (/p/...) + robots/sitemap.
  5. In-app /explore directory + search.

Steps 23 are the privacy-critical core and should be reviewed hardest.

9. Open questions

  • Caching: public pages are cacheable for SEO, but cache keys must not blur the redacted-vs-member rendering. Likely: cache only the anonymous projection at the edge; never cache member responses.
  • Do site_members trees appear in the sitemap for logged-in crawling? (Default: no — noindex.)
  • Per-tree opt-out of the directory even when public? (Probably unnecessary; unlisted already covers "reachable but not listed.")