How metalworks works — the whole engine, in detail
A complete internals reference forLab2A/metalworks: what it is, the contract layer, the swappable-protocol architecture, the demand pipeline, the design pillar, and the surfaces. Grounded in the real modules undersrc/metalworks/. A capability still living in an open PR (not yet onmain) is marked [planned] with its PR. For the why behind these choices, see Architecture; this page is the what, module by module.
0. What metalworks is, in one breath
Give it one sentence about a product idea. It reads real Reddit conversations (plus optional web research), tells you whether people actually want it, and turns that into the things you need to launch: positioning, the competitors to beat, a design system, a logo, a marketing site, a build spec, and launch copy. Every demand claim links back to a real comment you can click. Anything it cannot back with a real quote, it drops. Two identity rules run through everything:- Spec, don’t vendor. metalworks produces a runnable spec for your own coding agent (a Claude Code terminal) to build against. It is not a coding agent and does not host product backends.
- Nothing invented. The honesty spine (exact-quote verification, distinct-author breadth, deterministic verdicts) is never bypassed. Evidence can always say no. The one deliberate exception is visual design (§7): a logo or a palette can’t be cited to a quote, so design is grounded directionally and labelled honestly, never faked.
Metalworks facade, the metalworks.contract models, and
the MCP tool contracts; everything else can change in any 0.x release.
1. The five-stage arc
Research) is the durable artifact; the later pillars are exposed as methods on the
facade and as optional fields on the Research bundle. The bundle (contract/bundle.py) is
the stage-1 artifact: demand plus optional competitors, positioning, landscape,
assessment, ideation. The DESIGN stage is fully shipped (positioning, landscape,
surface/UX, the visual design system, the logo mark, the styled marketing site,
and a rendered-page design review — §7). Shape-matching in BUILD is [planned — PR #57].
2. The contract layer — the stable public API
metalworks.contract is the single source of truth for every surface (library, CLI, MCP,
generated TypeScript). Pydantic models with content-addressed evidence ids. The TypeScript
twins (ts/contract.ts) and JSON-schema snapshots (contract/schema/) are generated by
scripts/gen_ts_types.py; the regen / drift check is run by /pr-ready before a PR —
note CI does not gate gen_ts_types --check (it runs ruff / ruff-format / pyright /
pytest), so contract drift is a manual gate, not a CI one.
The demand report (contract/research.py)
DemandReport is the canonical output. Key parts:
ranked_clusters: list[InsightCluster]— ranked consumer-insight themes. Each cluster:claim— one-line synthesized insightdemand_score— weights distinct-author breadth above single-post viralitydistinct_author_count— the honest base rate (separate frommention_count)signal: SignalStrength(LOW / MEDIUM / HIGH) — the confidence chipquotes: list[ResolvedCitation]— verified quotes; no-quote-no-theme
ResolvedCitation— the portable, verified quote: verbatimtext(exact-matched to a real comment),source_url(the permalink),author_hash(salted, for distinct-author counting, never the raw username),engagement.- Fork selectors:
segments/candidate_wedges(options the engine surfaces, not collapses) withdefault_*/active_*accessors. web_findings,price_finding,audience_profile,market_sizing,source_map,corpus_stats,cross_references,must_address_resolution.evidence(computed property) — the flat, de-duplicatedEvidenceRecordlist every downstreamEvidenceRefresolves against.
The evidence spine (contract/evidence.py)
EvidenceRef (evidence_id + kind in + optional
cluster_rank) is how every downstream pillar points at upstream evidence by id, never by
free text. The no-cite-no-claim gate: a claim-bearing field with zero resolvable refs is
dropped at assembly.
Downstream pillar contracts
| Contract | File | What it carries |
|---|---|---|
PositioningBrief | positioning.py | Dunford wedge (competitive alt → unique attribute → value → beachhead → category) + price hypothesis; wedge is None when there’s no real white space |
Landscape / CompetitorMap | landscape.py | competitors (direct/adjacent/status-quo), gaps, existing solutions |
SurfaceRecommendation | surface.py | chosen: SurfaceKind ∈ + UX skeleton |
DesignSystem | design.py | aesthetic + SAFE/RISK DesignChoice per dimension + directional LandscapeSignals + grounding_tier + DESIGN.md (§7) |
DesignReview / StyleFinding | design.py | a deterministic audit of a rendered page’s computed styles vs the system (§7) |
LogoSet / LogoOption | logo.py | authored SVG logo options, drawn under the DesignSystem (§7) |
BuildSpec | build.py | features (each evidence-backed, cite-or-die), personas, pricing_tiers, stack hint |
Assessment / Decision | assess.py | GO / PIVOT / NO_GO — deterministic from demand × landscape; LLM only writes the rationale |
MarketingSite | contract/site.py (rendered by research/site.py) | verbatim-cited site sections + render_site_html |
ContentPlan | marketing.py | deterministic SEO/content plan, one page per cluster |
assess() computes the gap
(demand strength vs landscape saturation); a partial landscape can never yield a hard GO.
3. The swappable-protocol architecture
Every layer is aruntime_checkable Protocol with thin adapters and a deterministic fake.
Bare import metalworks pulls zero provider SDKs (or playwright); each adapter
lazy-imports its SDK and raises MissingExtraError with the exact pip install to run.
| Protocol | Module | Methods | Adapters |
|---|---|---|---|
ChatModel / GroundedChatModel | llm/protocol.py | complete_text, complete_structured, complete_grounded | anthropic, openai, google (+ FallbackChatModel, FakeChatModel) |
EmbeddingProvider | embeddings/ | embed(texts, task) + IndexIdentity guard | fastembed (local), openai, google (+ FakeEmbedding) |
SearchProvider | search/ | search(query, max_results, recency_days) | exa, tavily, parallel, firecrawl |
PageRenderer | render/ | render(url), extract_computed_styles(url, selectors) + capabilities | playwright (owned Chromium, [browser]), firecrawl (hosted, screenshot-only), FakeRenderer |
ItemSource / CorpusReader / CommentSource | research/sources/, research/deps.py | pull, comments_for, latest_window | arctic (Reddit), hackernews, producthunt, web |
repos (BriefRepo, RunRepo, CorpusRepo, AccountRepo, OpportunityRepo, InboxRepo, ArtifactStore) | stores/ | typed per-repo methods | memory, sqlite, filestore |
PageRenderer is infrastructure like SearchProvider — resolved by config.resolve_renderer()
(Playwright → Firecrawl → None), surfaced via doctor, with no skill/MCP tool of its own;
the design pillar (§7) is its first consumer. The protocol exposes no caller-supplied
JavaScript — style extraction runs a fixed, vendored script.
The SOURCES registry (research/sources/__init__.py, self-registering on import, lazy
builtin loading) is the recurring extensibility pattern. metalworks.testing ships
conformance suites (check_all_repos, check_item_source, check_page_renderer) so anyone
writing a custom adapter can verify it.
Provider auto-resolution (config.py): ambient env keys → adapter instances. Precedence is
explicit arg > env var > config file. Config files hold only non-secrets; all keys come from
env.
4. The demand pipeline, step by step
research/pipeline.py (run_research); dependencies are injected via
ResearchDeps (chat, fast_chat, embeddings, corpus, reader, search, comments, sources). Web
research is best-effort (a failure yields a partial report with a caveat); synthesis is
required. run_discovery (discovery/service.py) is the sibling loop for Reddit engagement
opportunities (filter → draft → gate), distinct from wedge validation.
Corpus sources. Submissions come from the Hugging Face open-index/arctic Parquet mirror
(read with DuckDB; HF_TOKEN for long windows) or a Supabase Storage mirror
(ARCTIC_SHIFT_SOURCE=mirror). Comments come from the live Arctic Shift API. Additional
sources (Hacker News, Product Hunt, web) plug in through ItemSource.
5. Surfaces
- Library facade (
client.py,Metalworks):.research(),.positioning(),.landscape(),.assess(),.ideate(),.validate(),.surface(),.ux(),.site(),.render_site(site, research, design=…),.design(),.render_design_preview(),.logo(),.render_logo_picker(),.design_review(),.build_spec(),.scaffold(),.launch(),.channel_plan(),.content_plan(), plus.redditand.discoverynamespaces and a.depsescape hatch..research()returns theResearchbundle; sub-pillars are pure functions over it. - CLI (
cli/):metalworkswith sub-appsresearch(the pillars:site,design,logo,design-review,launch, …),reddit,arctic,config,models,sources,corpus,browser(browser install),mcp serve, plus a top-level interactive menu,doctor, and arenderdebug command. Lazy-imports providers so the CLI starts free of heavy deps. - MCP server (
mcp/): tiered tools — Tier 1 zero-key (compliance lint, Reddit search, Arctic pulls, subreddit intel), Tier 2 key-gated (research + all pillar builders incl.design_from_report/logo_generate/design_review, ideate, assess, validate, discovery), Tier 3 gated + confirmed (Reddit posting requires a compliance pass + HMAC token +METALWORKS_ALLOW_POSTING=1). Each tool body is a plain function inmcp/tools.py;mcp/server.pyregisters a thin async wrapper per the_TOOL_WRAPPERStuple. - Claude Code plugin (
plugin/): 18 skills over the MCP tools — five engagement (/demand-report,/find-threads,/draft-reply,/subreddit-intel,/discovery) plus the grounded pillars (/position-wedge,/market-landscape,/surface-and-ux,/generate-site,/design,/logo,/design-review,/launch-kit,/content-plan,/build-spec,/go-no-go,/ideate,/validate).
reddit/) is its own subsystem: OAuth + encrypted tokens, public
search, subreddit intel, inbox, and gated posting. The compliance gate is deterministic
(heuristic_check) with an escalating LLM judge for uncertain cases — authentic, disclosed
engagement only.
6. The honesty + safety model (why you can trust the output)
- Quote verification: every
ResolvedCitation.textis exact-matched to a stored comment; unmatched quotes are dropped. A report that reaches the contract is guaranteed real. - Breadth over virality: ranking weights distinct authors, so 50 people each saying it once outranks 1 person saying it 200 times.
- Honest nulls: no white-space wedge →
positioning.wedge is None; thin demand → NO_GO; grounding unavailable →partial+ caveat, never a fabricated GO. - Deterministic decisions: verdicts, demand bands, and the design review are pure, testable functions; the LLM writes only human-facing rationale, never the decision.
- Embedding-identity guard: vectors carry
IndexIdentity; a model/dim mismatch is a hard error, never a silent degrade. - Posting / charging / prod are gated: deterministic compliance + confirm token + opt-in env flag — the same pattern reused by the deploy/billing capability (§9, [planned]).
7. The design pillar (visual)
The one place metalworks lets the model produce un-grounded output — because a logo or a palette can’t be cited to a Reddit quote. So design is the visual counterpart to positioning: positioning grounds the words, design grounds the look — directionally, not by citation, and always labelled honestly.The rendering primitive (render/)
The design pillar sits on the PageRenderer infrastructure (§3): an owned headless Chromium
(metalworks[browser] → Playwright) that screenshots a page and reads its computed
styles, with a hosted Firecrawl fallback (screenshot-only) and a FakeRenderer for tests.
Chromium is a post-install step (metalworks browser install); doctor reports the active
renderer tier without launching it.
The design system (research/design.py → DesignSystem)
build_design_system(deps, research) reads the competition at the richest tier available — a
real renderer teardown of competitor sites (their actual fonts/colors) > web text > the
model’s own knowledge — and authors a system under a constant house craft bar: an aesthetic
direction, one SAFE/RISK DesignChoice per dimension (typography, color, layout, …), and
directional LandscapeSignals. Two honesty signals: the SAFE/RISK stance, and the
grounding_tier (renderer / web / model_knowledge) so the look is never overstated.
Grounding is directional — there are no per-decision evidence_refs. Writes a DESIGN.md.
The logo submodule (research/logo.py → LogoSet)
build_logo_set(chat, system) draws diverse, company-grade SVG marks under the
DesignSystem (its aesthetic / type / color), one per design angle (symbol, logotype,
negative-space, reference, expressive). Offered, never auto-selected. An angle that returns
no valid SVG — or an unsafe one (a <script> / event handler / <foreignObject>) — is
dropped, never inlined.
The styled site + the review
render_site_html(site, report, system)inlines a brand stylesheet from theDesignSystem(fonts, accent, light/dark) — strictly additive: with no system, the site is the unstyled structural HTML as before.review_design(renderer, url, system=…)(research/design_review.py) is the QA half: a deterministic audit of a rendered page’s computed styles (fonts, heading scale, colors) against design hard-rules (too many fonts, a convergence-trap body face, a non-monotonic heading scale) and, with a system, whether the page matches it. The model writes nothing; needs a script-capable renderer (Playwright).
mw.design() / mw.logo() / mw.design_review(), the
metalworks research design / logo / design-review commands, the design_from_report /
logo_generate / design_review MCP tools, and the /design / /logo / /design-review
skills.
8. The startup-shapes catalog [planned — PR #57]
Not yet onTurns “build a product from the demand” from bespoke into reusable. A shape is a reference architecture for a class of product, in two layers:main. Lives in the openfeat/startup-shapesPR; this section describes what that PR adds, marked planned per the preamble.
ShapeMatcher.match(research, *, surface=None, build_spec=None, min_score=0.5) ranks
registered ProductShapes against a report — pure, read-only, verdict-reactive (NO_GO →
no match; PIVOT → the pivot fork’s clusters). Scoring is embedding-similarity with a
deterministic keyword fallback, each match cited to the clusters that drove it. Each base
carries a scaffold_target pointer a Claude Code terminal resolves to a starter (spec, don’t
vendor). See notes/57-shapes-and-plan-build.md for the planned plan-build orchestration
and the open module-layer questions.
9. End-to-end: idea → live, paid product
#57) and deploy + billing (#51 — metalworks deploy to Vercel and metalworks billing create to Stripe, new DeployProvider / BillingProvider protocols mirroring the
llm/search adapters, irreversible steps gated like Reddit posting).
10. The Clique relationship
Clique-Labs/metalworks is a separate, private Next.js factory that builds and hosts
micro-SaaS products. It consumes this OSS engine as its demand brain (a pip dependency; the
adapter maps the OSS arc → its WedgeSpec). The engine specs; the factory (and any Claude
Code terminal) builds. Clique itself is the reference implementation of the synthesize
shape (external-data ingestion → LLM synthesis → cited evidence → dashboard).
11. Packaging, distribution, testing
- Install:
pip install "metalworks[<provider>,research]"+ one env key. Extras pull provider SDKs / DuckDB / redditwarp / supabase / mcp / playwright ([browser], a Chromium post-install step) behind[...]; core stays lean (pydantic, httpx, typer, rich). - Distribution: PyPI, the
metalworksCLI, the MCP server, the Claude Code plugin. - Quality bars: pyright strict on
src/, ruff, pytest run offline by default (--disable-socket; network tests gated-m network, real-browser tests-m browser).metalworks.testingships conformance suites for custom adapters/backends. The contract drift check (gen_ts_types --check) is run by/pr-ready, not by CI. - Honesty in tests: the demand pipeline’s guarantees (exact-quote match, breadth weighting, deterministic verdict) and the deterministic design review are CI-tested, not aspirational.