hackernews source fetches live from the
public HN search API — keyless and always current, but one request at a time. The
hackernews_archive source reads the whole of HN (stories and comments,
2006→present) from a large public Parquet archive, so you can search across years at
once and run fully offline.
Is this for you? Use it if you want to search a lot of HN history, run offline, or
avoid hammering the live API. For a quick, current lookup, the live hackernews source
is simpler. The tradeoff: the archive is big, and a local copy is a snapshot — re-download
to pick up newer posts.
Install
duckdb, which the reader uses. (The download script itself needs nothing
beyond the standard library.)
Download a slice
Hacker News isn’t split by topic, so one month is a single file covering the whole site — recent months are hundreds of MB. Start with one month:| Flag | Meaning |
|---|---|
--months, -m INT | How many months back to download, ending at the latest available (default 1). |
--out, -o PATH | Where to write the corpus (default ./hn-corpus). |
--hf-token TOKEN | Access token for the archive, if you have one (optional). |
--force | Re-download months you already have. |
What you get
One Parquet file per month, holding every story and comment for that month:Point metalworks at it
Read from a Supabase mirror
For a shared, always-available copy (instead of a local download on each machine), you can mirror the months you want into a private Supabase Storage bucket and read them over signed URLs — no HF and no local files at query time. Point the source at aHackerNewsArchiveMirrorReader (needs the supabase extra):
HN_ARCHIVE_SOURCE=mirror in the environment and --source hackernews_archive
resolves to the mirror automatically. This mirrors how Reddit’s Supabase tier works.