Bulk crawling

Seed post URLs from listings, queue them, crawl into a local SQLite store, and export the records.

For more than a page at a time, reddit has a small pipeline: seed post URLs from subreddit listings, enqueue them, crawl the queue into a local SQLite store, and export what you collected. Everything lands in one database file under your data dir.

1. Seed from listings

seed walks one or more subreddit listings and emits the comment-page URL of every post it finds:

reddit seed golang --sort top --time week

It takes the same --sort and --time as posts, and the same -n and --pages to control how deep it walks. With no --enqueue it just prints the URLs, so you can pipe them somewhere else:

reddit seed golang rust --sort top -o jsonl | jq -r .url

2. Enqueue

Add --enqueue to put the discovered URLs into the crawl queue instead of just printing them:

reddit seed golang --sort top --time week --enqueue

3. Crawl the queue

crawl drains the queue: it fetches each post, caches the page, and with --parse also stores the post and its comments in the records table. --max caps how many to process (0 drains the whole queue). It uses the global --workers and --delay, so it stays polite:

reddit crawl --max 50 --parse

It reports how many it processed and failed. Exit code 3 means nothing was processed (an empty queue); exit code 4 means some failed (see troubleshooting).

4. Inspect and export the store

db works with the local SQLite store:

reddit db info                          # summarize records and the queue
reddit db count post                    # how many posts are stored
reddit db get post 1abc23               # one stored record as JSON
reddit db export --type post --out posts.jsonl
reddit db vacuum                        # reclaim space

db export writes every stored record (or just one --type) to a file with --out, or to stdout.

The page cache

Every fetch goes through an on-disk cache (content-addressed, gzip), so a re-crawl does not re-fetch pages that have not changed. cache manages it:

reddit cache info                                              # location, file count, size
reddit cache path https://www.reddit.com/comments/1abc23.json   # the cache file for a URL
reddit cache clear                                             # remove every cached page

The cache TTL defaults to 24 hours. Bypass it for one run with --no-cache, or force a re-fetch with --refresh.

The whole pipeline

Put together, collecting a slice of a community looks like this:

reddit seed golang --sort top --time month -n 200 --enqueue
reddit crawl --parse
reddit db export --type post --out golang.jsonl

The store and cache both live under the data dir; point that elsewhere with --data-dir or REDDIT_DATA_DIR, and point the database file alone with --store. See configuration.