- A new source (a forum, a reviews API, an internal dataset, anything not
Reddit-shaped) → write an
ItemSourceconnector. It’s the modern, source-neutral path: you map your items ontoCorpusRecord/CorpusCommentand they ingest into the shared corpus alongside Reddit, Hacker News, and the web. See Sources → bring your own source. - Your own Reddit data (local parquet, a database, a cache instead of the
Arctic Shift archive) → implement
CorpusReader+CommentSource, the Reddit-archive backend seam covered below.
reddit /
arctic connectors read from your data instead.
The protocols
RedditPost / RedditComment
and writes them to your store.
Wire it in
Comments are optional
If you have submissions but no comment source, passcomments=None (the default
when offline). The pipeline marks the report partial with a caveat rather than
failing. Cluster quotes come from comments, so a comments-less run produces
submission-level signal only.
Fully offline
Point a local reader at committed parquet and use fake models and an in-memory store, and the whole pipeline runs with no network. This is the pattern for tests and air-gapped runs.Arbitrary, non-Reddit data
If your data isn’t Reddit-shaped, don’t useCorpusReader — write an
ItemSource instead. It maps any items
onto the source-neutral CorpusRecord / CorpusComment spine, so they ingest,
triage, cluster, and rank exactly like every built-in source. That’s the
shipped path for a true non-Reddit corpus.