CorpusReader protocol — Arctic
Shift is the default implementation, not a requirement (see
Use your own corpus). This guide takes the next step:
instead of letting the library reach out to the (slow, rate-limited) Hugging Face
mirror at runtime, you run a sample loader script once to materialize a local
Parquet corpus, then point metalworks at that directory.
The script lives at scripts/load_arctic_corpus.py in the
repo. It is standalone — standard library plus duckdb only — so it doubles as a
copy-paste reference if you want to adapt it for a different mirror or source.
Why build your own corpus
- No runtime dependency on the HF mirror. The mirror is convenient for a quick run but slow and rate-limited. A local corpus is fast and offline.
- Reproducibility. A committed corpus directory pins exactly what a run saw.
- Control. Pull only the subreddits and months you care about, once.
Prerequisites
duckdb, which the script uses to read the mirror’s Parquet shards
over httpfs. (The script will run with a bare pip install duckdb too, but
you’ll want the full extra to point metalworks at the result.)
Pull a corpus
| Flag | Meaning |
|---|---|
--subreddit, -s NAME | Subreddit to pull. Repeatable: -s Supplements -s Nootropics. |
--months, -m INT | How many months back to pull, ending at the latest available month (default 1). |
--out, -o PATH | Output corpus root (default ./corpus). |
--comments | Also fetch live comment trees for pulled submissions (Arctic Shift API). |
--limit INT | Max submissions per subreddit-month (default: no limit). |
--hf-token TOKEN | Hugging Face token for authenticated mirror reads (optional). |
python scripts/load_arctic_corpus.py --help for the full list.
What it produces
The script writes a directory laid out exactly howArcticReader globs for
shards:
Point metalworks at it
ArcticReader(data_root=...) reads the local layout with no hf:// access:
comments=None and the report comes
back partial with submission-level signal only (see
Comments are optional).
Going further
The loader is deliberately small and source-specific. To run research over a non-Arctic source — your own database, an internal API, a cache — implement theCorpusReader and CommentSource protocols directly; that path is documented in
Use your own corpus.