How metalworks works — the whole engine, in detail

A complete internals reference for Lab2A/metalworks: what it is, the contract layer, the swappable-protocol architecture, the demand pipeline, the design pillar, and the surfaces. Grounded in the real modules under src/metalworks/. A capability still living in an open PR (not yet on main) is marked [planned] with its PR. For the why behind these choices, see Architecture; this page is the what, module by module.

0. What metalworks is, in one breath

Give it one sentence about a product idea. It reads real Reddit conversations (plus optional web research), tells you whether people actually want it, and turns that into the things you need to launch: positioning, the competitors to beat, a design system, a logo, a marketing site, a build spec, and launch copy. Every demand claim links back to a real comment you can click. Anything it cannot back with a real quote, it drops. Two identity rules run through everything:

Spec, don’t vendor. metalworks produces a runnable spec for your own coding agent (a Claude Code terminal) to build against. It is not a coding agent and does not host product backends.
Nothing invented. The honesty spine (exact-quote verification, distinct-author breadth, deterministic verdicts) is never bypassed. Evidence can always say no. The one deliberate exception is visual design (§7): a logo or a palette can’t be cited to a quote, so design is grounded directionally and labelled honestly, never faked.

It ships as a Python library, a CLI, an MCP server, and a Claude Code plugin. MIT, pre-release 0.0.x — the stable surface is the Metalworks facade, the metalworks.contract models, and the MCP tool contracts; everything else can change in any 0.x release.

1. The five-stage arc

   idea sentence
        │
        ▼
  ┌───────────┐   ┌───────────┐   ┌───────────┐   ┌──────────┐   ┌──────────┐
  │ RESEARCH  │──▶│  DESIGN    │──▶│  BUILD    │──▶│  LAUNCH  │──▶│  GROWTH  │
  │ demand    │   │ position/  │   │ BuildSpec │   │ launch   │   │ content/ │
  │ report    │   │ landscape/ │   │ (+ shape  │   │ assets,  │   │ SEO,     │
  │ (cited)   │   │ surface/UX │   │  match    │   │ reply    │   │ engage   │
  │           │   │ + design/  │   │  [planned │   │          │   │          │
  │           │   │ logo/site  │   │   #57])   │   │          │   │          │
  └───────────┘   └───────────┘   └───────────┘   └──────────┘   └──────────┘
        │              │               │
        └── each stage emits one FROZEN, TYPED bundle; downstream resolves
            EvidenceRefs against the upstream report's evidence list.

Stage 1 (Research) is the durable artifact; the later pillars are exposed as methods on the facade and as optional fields on the Research bundle. The bundle (contract/bundle.py) is the stage-1 artifact: demand plus optional competitors, positioning, landscape, assessment, ideation. The DESIGN stage is fully shipped (positioning, landscape, surface/UX, the visual design system, the logo mark, the styled marketing site, and a rendered-page design review — §7). Shape-matching in BUILD is [planned — PR #57].

2. The contract layer — the stable public API

metalworks.contract is the single source of truth for every surface (library, CLI, MCP, generated TypeScript). Pydantic models with content-addressed evidence ids. The TypeScript twins (ts/contract.ts) and JSON-schema snapshots (contract/schema/) are generated by scripts/gen_ts_types.py; the regen / drift check is run by /pr-ready before a PR — note CI does not gate gen_ts_types --check (it runs ruff / ruff-format / pyright / pytest), so contract drift is a manual gate, not a CI one.

The demand report (`contract/research.py`)

DemandReport is the canonical output. Key parts:

ranked_clusters: list[InsightCluster] — ranked consumer-insight themes. Each cluster:
- claim — one-line synthesized insight
- demand_score — weights distinct-author breadth above single-post virality
- distinct_author_count — the honest base rate (separate from mention_count)
- signal: SignalStrength (LOW / MEDIUM / HIGH) — the confidence chip
- quotes: list[ResolvedCitation] — verified quotes; no-quote-no-theme
ResolvedCitation — the portable, verified quote: verbatim text (exact-matched to a real comment), source_url (the permalink), author_hash (salted, for distinct-author counting, never the raw username), engagement.
Fork selectors: segments / candidate_wedges (options the engine surfaces, not collapses) with default_* / active_* accessors.
web_findings, price_finding, audience_profile, market_sizing, source_map, corpus_stats, cross_references, must_address_resolution.
evidence (computed property) — the flat, de-duplicated EvidenceRecord list every downstream EvidenceRef resolves against.

The evidence spine (`contract/evidence.py`)

EvidenceRef (evidence_id + kind in + optional cluster_rank) is how every downstream pillar points at upstream evidence by id, never by free text. The no-cite-no-claim gate: a claim-bearing field with zero resolvable refs is dropped at assembly.

Downstream pillar contracts

Contract	File	What it carries
`PositioningBrief`	`positioning.py`	Dunford wedge (competitive alt → unique attribute → value → beachhead → category) + price hypothesis; `wedge` is `None` when there’s no real white space
`Landscape` / `CompetitorMap`	`landscape.py`	competitors (direct/adjacent/status-quo), gaps, existing solutions
`SurfaceRecommendation`	`surface.py`	`chosen: SurfaceKind` ∈ + UX skeleton
`DesignSystem`	`design.py`	aesthetic + SAFE/RISK `DesignChoice` per dimension + directional `LandscapeSignal`s + `grounding_tier` + `DESIGN.md` (§7)
`DesignReview` / `StyleFinding`	`design.py`	a deterministic audit of a rendered page’s computed styles vs the system (§7)
`LogoSet` / `LogoOption`	`logo.py`	authored SVG logo options, drawn under the `DesignSystem` (§7)
`BuildSpec`	`build.py`	`features` (each evidence-backed, cite-or-die), `personas`, `pricing_tiers`, `stack` hint
`Assessment` / `Decision`	`assess.py`	GO / PIVOT / NO_GO — deterministic from demand × landscape; LLM only writes the rationale
`MarketingSite`	`contract/site.py` (rendered by `research/site.py`)	verbatim-cited site sections + `render_site_html`
`ContentPlan`	`marketing.py`	deterministic SEO/content plan, one page per cluster

The deterministic verdict is the heart of the honesty model: assess() computes the gap (demand strength vs landscape saturation); a partial landscape can never yield a hard GO.

3. The swappable-protocol architecture

Every layer is a runtime_checkable Protocol with thin adapters and a deterministic fake. Bare import metalworks pulls zero provider SDKs (or playwright); each adapter lazy-imports its SDK and raises MissingExtraError with the exact pip install to run.

        Metalworks facade  /  CLI  /  MCP  /  plugin
                         │
   ┌────────┬──────────┬─┴────────┬──────────┬──────────┬───────────┐
   ▼        ▼          ▼          ▼          ▼          ▼           ▼
ChatModel  Embedding  Search    PageRenderer ItemSource Stores  (ShapeMatcher
 (llm/)    Provider   Provider   (render/)   + Corpus   (memory, [planned #57])
           (embeddings)(search/)  playwright  Reader     sqlite,
  anthropic fastembed exa         firecrawl  (research/  file)
  openai    openai    tavily      (+fake)    sources/)
  google    google    parallel               arctic, hackernews,
  (+fallback)         firecrawl              producthunt, web

Protocol	Module	Methods	Adapters
`ChatModel` / `GroundedChatModel`	`llm/protocol.py`	`complete_text`, `complete_structured`, `complete_grounded`	anthropic, openai, google (+ `FallbackChatModel`, `FakeChatModel`)
`EmbeddingProvider`	`embeddings/`	`embed(texts, task)` + `IndexIdentity` guard	fastembed (local), openai, google (+ `FakeEmbedding`)
`SearchProvider`	`search/`	`search(query, max_results, recency_days)`	exa, tavily, parallel, firecrawl
`PageRenderer`	`render/`	`render(url)`, `extract_computed_styles(url, selectors)` + `capabilities`	playwright (owned Chromium, `[browser]`), firecrawl (hosted, screenshot-only), `FakeRenderer`
`ItemSource` / `CorpusReader` / `CommentSource`	`research/sources/`, `research/deps.py`	`pull`, `comments_for`, `latest_window`	arctic (Reddit), hackernews, producthunt, web
repos (`BriefRepo`, `RunRepo`, `CorpusRepo`, `AccountRepo`, `OpportunityRepo`, `InboxRepo`, `ArtifactStore`)	`stores/`	typed per-repo methods	memory, sqlite, filestore

PageRenderer is infrastructure like SearchProvider — resolved by config.resolve_renderer() (Playwright → Firecrawl → None), surfaced via doctor, with no skill/MCP tool of its own; the design pillar (§7) is its first consumer. The protocol exposes no caller-supplied JavaScript — style extraction runs a fixed, vendored script. The SOURCES registry (research/sources/__init__.py, self-registering on import, lazy builtin loading) is the recurring extensibility pattern. metalworks.testing ships conformance suites (check_all_repos, check_item_source, check_page_renderer) so anyone writing a custom adapter can verify it. Provider auto-resolution (config.py): ambient env keys → adapter instances. Precedence is explicit arg > env var > config file. Config files hold only non-secrets; all keys come from env.

4. The demand pipeline, step by step

question + subreddits
      │
      ▼
[1] plan brief        brief_from_question (D1-D8) + pick_target_subreddits   (LLM)
      │
      ▼
[2] pull corpus       ArcticReader: HF open-index/arctic Parquet via DuckDB (submissions)
      │
      ▼
[3] triage            embed + 3-bucket (accept / classify / reject) by cosine+BM25 hybrid
      │
      ▼
[4] hydrate           ArcticShiftApiClient: live comment trees for the relevant subset
      │
      ├───────────────┐
      ▼               ▼
[5] synthesize     [5'] web research (parallel)   GroundedChatModel or SearchProvider
   cluster + rank      structured WebFindings
      │               │
      └──────┬────────┘
             ▼
[6] triangulate    cross-stream agreement (agree / silent_web / silent_corpus / disagree)
             │      + QUOTE VERIFICATION: every quote exact-matched to a stored comment,
             │      or it is dropped
             ▼
        DemandReport  (ranked_clusters, each cited; partial+caveat on graceful failure)

Orchestration lives in research/pipeline.py (run_research); dependencies are injected via ResearchDeps (chat, fast_chat, embeddings, corpus, reader, search, comments, sources). Web research is best-effort (a failure yields a partial report with a caveat); synthesis is required. run_discovery (discovery/service.py) is the sibling loop for Reddit engagement opportunities (filter → draft → gate), distinct from wedge validation. Corpus sources. Submissions come from the Hugging Face open-index/arctic Parquet mirror (read with DuckDB; HF_TOKEN for long windows) or a Supabase Storage mirror (ARCTIC_SHIFT_SOURCE=mirror). Comments come from the live Arctic Shift API. Additional sources (Hacker News, Product Hunt, web) plug in through ItemSource.

5. Surfaces

Library facade (client.py, Metalworks): .research(), .positioning(), .landscape(), .assess(), .ideate(), .validate(), .surface(), .ux(), .site(), .render_site(site, research, design=…), .design(), .render_design_preview(), .logo(), .render_logo_picker(), .design_review(), .build_spec(), .scaffold(), .launch(), .channel_plan(), .content_plan(), plus .reddit and .discovery namespaces and a .deps escape hatch. .research() returns the Research bundle; sub-pillars are pure functions over it.
CLI (cli/): metalworks with sub-apps research (the pillars: site, design, logo, design-review, launch, …), reddit, arctic, config, models, sources, corpus, browser (browser install), mcp serve, plus a top-level interactive menu, doctor, and a render debug command. Lazy-imports providers so the CLI starts free of heavy deps.
MCP server (mcp/): tiered tools — Tier 1 zero-key (compliance lint, Reddit search, Arctic pulls, subreddit intel), Tier 2 key-gated (research + all pillar builders incl. design_from_report / logo_generate / design_review, ideate, assess, validate, discovery), Tier 3 gated + confirmed (Reddit posting requires a compliance pass + HMAC token + METALWORKS_ALLOW_POSTING=1). Each tool body is a plain function in mcp/tools.py; mcp/server.py registers a thin async wrapper per the _TOOL_WRAPPERS tuple.
Claude Code plugin (plugin/): 18 skills over the MCP tools — five engagement (/demand-report, /find-threads, /draft-reply, /subreddit-intel, /discovery) plus the grounded pillars (/position-wedge, /market-landscape, /surface-and-ux, /generate-site, /design, /logo, /design-review, /launch-kit, /content-plan, /build-spec, /go-no-go, /ideate, /validate).

Reddit engagement (reddit/) is its own subsystem: OAuth + encrypted tokens, public search, subreddit intel, inbox, and gated posting. The compliance gate is deterministic (heuristic_check) with an escalating LLM judge for uncertain cases — authentic, disclosed engagement only.

6. The honesty + safety model (why you can trust the output)

Quote verification: every ResolvedCitation.text is exact-matched to a stored comment; unmatched quotes are dropped. A report that reaches the contract is guaranteed real.
Breadth over virality: ranking weights distinct authors, so 50 people each saying it once outranks 1 person saying it 200 times.
Honest nulls: no white-space wedge → positioning.wedge is None; thin demand → NO_GO; grounding unavailable → partial + caveat, never a fabricated GO.
Deterministic decisions: verdicts, demand bands, and the design review are pure, testable functions; the LLM writes only human-facing rationale, never the decision.
Embedding-identity guard: vectors carry IndexIdentity; a model/dim mismatch is a hard error, never a silent degrade.
Posting / charging / prod are gated: deterministic compliance + confirm token + opt-in env flag — the same pattern reused by the deploy/billing capability (§9, [planned]).

7. The design pillar (visual)

The one place metalworks lets the model produce un-grounded output — because a logo or a palette can’t be cited to a Reddit quote. So design is the visual counterpart to positioning: positioning grounds the words, design grounds the look — directionally, not by citation, and always labelled honestly.

The rendering primitive (`render/`)

The design pillar sits on the PageRenderer infrastructure (§3): an owned headless Chromium (metalworks[browser] → Playwright) that screenshots a page and reads its computed styles, with a hosted Firecrawl fallback (screenshot-only) and a FakeRenderer for tests. Chromium is a post-install step (metalworks browser install); doctor reports the active renderer tier without launching it.

The design system (`research/design.py` → `DesignSystem`)

build_design_system(deps, research) reads the competition at the richest tier available — a real renderer teardown of competitor sites (their actual fonts/colors) > web text > the model’s own knowledge — and authors a system under a constant house craft bar: an aesthetic direction, one SAFE/RISK DesignChoice per dimension (typography, color, layout, …), and directional LandscapeSignals. Two honesty signals: the SAFE/RISK stance, and the grounding_tier (renderer / web / model_knowledge) so the look is never overstated. Grounding is directional — there are no per-decision evidence_refs. Writes a DESIGN.md.

The logo submodule (`research/logo.py` → `LogoSet`)

build_logo_set(chat, system) draws diverse, company-grade SVG marks under the DesignSystem (its aesthetic / type / color), one per design angle (symbol, logotype, negative-space, reference, expressive). Offered, never auto-selected. An angle that returns no valid SVG — or an unsafe one (a <script> / event handler / <foreignObject>) — is dropped, never inlined.

The styled site + the review

render_site_html(site, report, system) inlines a brand stylesheet from the DesignSystem (fonts, accent, light/dark) — strictly additive: with no system, the site is the unstyled structural HTML as before.
review_design(renderer, url, system=…) (research/design_review.py) is the QA half: a deterministic audit of a rendered page’s computed styles (fonts, heading scale, colors) against design hard-rules (too many fonts, a convergence-trap body face, a non-monotonic heading scale) and, with a system, whether the page matches it. The model writes nothing; needs a script-capable renderer (Playwright).

Reachable on all four surfaces: mw.design() / mw.logo() / mw.design_review(), the metalworks research design / logo / design-review commands, the design_from_report / logo_generate / design_review MCP tools, and the /design / /logo / /design-review skills.

8. The startup-shapes catalog [planned — PR #57]

Not yet on main. Lives in the open feat/startup-shapes PR; this section describes what that PR adds, marked planned per the preamble.

Turns “build a product from the demand” from bespoke into reusable. A shape is a reference architecture for a class of product, in two layers:

LAYER 1 — 6 BASE STACKS (the reusable backend a CC terminal builds from)
  store · match · synthesize · automate · generate · watch

LAYER 2 — COMPOSABLE MODULES (payments · feed · threads · progress · paywall)

= ~25 NAMED PRODUCT SHAPES (base + modules + a thin domain skin)
  e.g. submission-portal (store), goods-marketplace (match),
       demand-intelligence (synthesize — this is Clique), price-monitor (watch)

A ShapeMatcher.match(research, *, surface=None, build_spec=None, min_score=0.5) ranks registered ProductShapes against a report — pure, read-only, verdict-reactive (NO_GO → no match; PIVOT → the pivot fork’s clusters). Scoring is embedding-similarity with a deterministic keyword fallback, each match cited to the clusters that drove it. Each base carries a scaffold_target pointer a Claude Code terminal resolves to a starter (spec, don’t vendor). See notes/57-shapes-and-plan-build.md for the planned plan-build orchestration and the open module-layer questions.

9. End-to-end: idea → live, paid product

"is there demand for X?"
   │  research            DemandReport (cited)
   ▼
 assess()                 GO / PIVOT / NO_GO        ── NO_GO ─▶ stop
   │ GO
   ▼
 design()                 DesignSystem + logo + a styled, cited marketing site
   │
   ▼
 build_spec / scaffold    an evidence-grounded build harness for a coding agent
   │                      (+ shape match  [planned — PR #57])
   ▼
 deploy + bill            metalworks deploy (Vercel) + metalworks billing (Stripe)  [planned — PR #51]
   │                      pure subscription-gate + webhook mapper, test-mode by default
   ▼
 launch reply             a disclosed, non-salesy Reddit reply to the originating thread

Shipped today: research → assess → the full design pillar (design system, logo, styled site, review) → build spec + scaffold → launch + content. [planned], in open PRs: shape match (#57) and deploy + billing (#51 — metalworks deploy to Vercel and metalworks billing create to Stripe, new DeployProvider / BillingProvider protocols mirroring the llm/search adapters, irreversible steps gated like Reddit posting).

10. The Clique relationship

Clique-Labs/metalworks is a separate, private Next.js factory that builds and hosts micro-SaaS products. It consumes this OSS engine as its demand brain (a pip dependency; the adapter maps the OSS arc → its WedgeSpec). The engine specs; the factory (and any Claude Code terminal) builds. Clique itself is the reference implementation of the synthesize shape (external-data ingestion → LLM synthesis → cited evidence → dashboard).

11. Packaging, distribution, testing

Install: pip install "metalworks[<provider>,research]" + one env key. Extras pull provider SDKs / DuckDB / redditwarp / supabase / mcp / playwright ([browser], a Chromium post-install step) behind [...]; core stays lean (pydantic, httpx, typer, rich).
Distribution: PyPI, the metalworks CLI, the MCP server, the Claude Code plugin.
Quality bars: pyright strict on src/, ruff, pytest run offline by default (--disable-socket; network tests gated -m network, real-browser tests -m browser). metalworks.testing ships conformance suites for custom adapters/backends. The contract drift check (gen_ts_types --check) is run by /pr-ready, not by CI.
Honesty in tests: the demand pipeline’s guarantees (exact-quote match, breadth weighting, deterministic verdict) and the deterministic design review are CI-tested, not aspirational.

Appendix — module map

src/metalworks/
  client.py            Metalworks facade + lazy dependency resolver
  config.py            provider auto-resolution (incl. resolve_renderer), non-secret config
  errors.py            MetalworksError, MissingExtraError, MissingKeyError, BrowserNotInstalledError, ...
  contract/            the stable Pydantic API (research, positioning, landscape, surface,
                       design, logo, build, assess, site, marketing, bundle, evidence, ...)
  render/              PageRenderer protocol + adapters (playwright/firecrawl) + FakeRenderer
  research/            pipeline.py, deps.py, planner/, exploration/, synthesis/, triangulate/,
                       sources/ (arctic, hackernews, producthunt, web), site.py, design.py,
                       design_review.py, logo.py
  discovery/           run_discovery (engagement opportunities)
  reddit/              oauth, search, subreddit intel, inbox, compliance, posting
  llm/                 ChatModel protocol + adapters (anthropic/openai/google) + fallback + fake
  embeddings/          EmbeddingProvider + adapters + FakeEmbedding + IndexIdentity guard
  search/              SearchProvider + adapters (exa/tavily/parallel/firecrawl)
  stores/              repo protocols + memory/sqlite/file backends + token crypto
  build/               BuildSpec assembly + scaffold harness
  cli/                 the metalworks CLI
  mcp/                 FastMCP server + tiered tools + jobs
  testing/             conformance suites + fakes
  project.py           .metalworks/ project detection + run persistence
  shapes/              startup-shape catalog + matcher  [planned — PR #57]

​How metalworks works — the whole engine, in detail

​0. What metalworks is, in one breath

​1. The five-stage arc

​2. The contract layer — the stable public API

​The demand report (contract/research.py)

​The evidence spine (contract/evidence.py)

​Downstream pillar contracts

​3. The swappable-protocol architecture

​4. The demand pipeline, step by step

​5. Surfaces

​6. The honesty + safety model (why you can trust the output)

​7. The design pillar (visual)

​The rendering primitive (render/)

​The design system (research/design.py → DesignSystem)

​The logo submodule (research/logo.py → LogoSet)

​The styled site + the review

​8. The startup-shapes catalog [planned — PR #57]

​9. End-to-end: idea → live, paid product

​10. The Clique relationship

​11. Packaging, distribution, testing

​Appendix — module map