Skip to main content
metalworks reads Hacker News two ways. The hackernews source fetches live from the public HN search API — keyless and always current, but one request at a time. The hackernews_archive source reads the whole of HN (stories and comments, 2006→present) from a large public Parquet archive, so you can search across years at once and run fully offline. Is this for you? Use it if you want to search a lot of HN history, run offline, or avoid hammering the live API. For a quick, current lookup, the live hackernews source is simpler. The tradeoff: the archive is big, and a local copy is a snapshot — re-download to pick up newer posts.

Install

pip install "metalworks[arctic]"
This includes duckdb, which the reader uses. (The download script itself needs nothing beyond the standard library.)

Download a slice

Hacker News isn’t split by topic, so one month is a single file covering the whole site — recent months are hundreds of MB. Start with one month:
python scripts/load_hn_corpus.py --months 1 --out ./hn-corpus
FlagMeaning
--months, -m INTHow many months back to download, ending at the latest available (default 1).
--out, -o PATHWhere to write the corpus (default ./hn-corpus).
--hf-token TOKENAccess token for the archive, if you have one (optional).
--forceRe-download months you already have.
You can also read the archive directly with no download by leaving the data root at its default — but a month is large to stream over the network on every run, so a local copy is the fast path.

What you get

One Parquet file per month, holding every story and comment for that month:
hn-corpus/
  2026/
    2026-06.parquet
    2026-05.parquet

Point metalworks at it

from metalworks import Metalworks
from metalworks.research.sources.hn_archive import (
    HackerNewsArchiveReader,
    HackerNewsArchiveSource,
)

reader = HackerNewsArchiveReader(data_root="./hn-corpus")
mw = Metalworks(sources=[HackerNewsArchiveSource(reader=reader)])
mw.research("a budget mechanical keyboard for programmers")
Stories are matched to your question by keyword; each story’s full comment thread is read straight from the same files, so your quotes come from real HN comments with nothing fetched live.

Read from a Supabase mirror

For a shared, always-available copy (instead of a local download on each machine), you can mirror the months you want into a private Supabase Storage bucket and read them over signed URLs — no HF and no local files at query time. Point the source at a HackerNewsArchiveMirrorReader (needs the supabase extra):
from metalworks.research.sources.hn_archive import (
    HackerNewsArchiveMirrorReader,
    HackerNewsArchiveSource,
)

# reads months tracked in a `hackernews_pulls` table + shards under <YYYY>/<MM>/
reader = HackerNewsArchiveMirrorReader()   # SUPABASE_URL + SUPABASE_SERVICE_ROLE_KEY from env
mw = Metalworks(sources=[HackerNewsArchiveSource(reader=reader)])
Or set HN_ARCHIVE_SOURCE=mirror in the environment and --source hackernews_archive resolves to the mirror automatically. This mirrors how Reddit’s Supabase tier works.

Other sources

This is one of several sources you can read from. For Reddit’s archive, see Use Reddit’s archive offline; to plug in something else, add your own source.