CorpusReader and comments through a
CommentSource. Implement those two small protocols and you can feed research
from anything: local parquet, your own database, an internal API, or a cache.
The protocols
RedditPost / RedditComment
and writes them to your store.
Wire it in
Comments are optional
If you have submissions but no comment source, passcomments=None (the default
when offline). The pipeline marks the report partial with a caveat rather than
failing. Cluster quotes come from comments, so a comments-less run produces
submission-level signal only.
Fully offline
Point a local reader at committed parquet and use fake models and an in-memory store, and the whole pipeline runs with no network — exactly whatMetalworks.demo()
does internally. This is the pattern for tests and air-gapped runs.
Heads up: a true non-Reddit corpus (arbitrary documents, no subreddits or permalinks) is a planned future seam — today the data shape is still Reddit-flavored even though the data source is fully swappable.