Screenshot Capture

A multi-stage, fully on-device pipeline that finds every distinct slide, reads its text to remove duplicates, and saves the sharpest frame of each.

How It Works

Screenshot capture runs in two phases using Apple's AVFoundation framework. A low-resolution scan decides which moments are worth capturing; a full-resolution capture then saves the sharpest frame of each. Both phases use the GPU and video decoder, running in parallel with WhisperKit's transcription on the Neural Engine. Several on-device intelligence layers — Apple Vision text detection, accurate OCR, and perceptual hashing — run during the scan to ensure the result is the slides that matter, captured once each, and not a flood of near-identical frames of the presenter. Nothing leaves your Mac.

Phase 1 — Scan

The video is decoded at 640×360 and 2 fps. (The earlier pipeline scanned at 160×90; the larger size is needed so that slide text is big enough for Apple Vision to detect and read — a headline that is 60px tall at 1080p would render at ~5px at 160×90 and be invisible to text detection.) For every sampled frame, the scan combines three signals:

Change detection — A coverage metric measures what fraction of pixels changed from the previous frame. This is used both to detect a new scene and to confirm a slide has settled (stopped animating) before it is considered for capture.
Text detection — Apple Vision shape detection (VNDetectTextRectangles) checks whether the frame carries text structure distributed across it — a slide, whiteboard, map, or labeled diagram. A logo-print shirt, a news ticker, or a lower-third caption confined to the bottom of the frame does not qualify.
De-duplication — Before a candidate is queued, it is checked against everything already captured this run, so the same slide is never saved twice and the recurring shot of the speaker collapses to a single representative. See Choosing the Best Slides.

If a slide remains on screen longer than the configured max interval, the clock advances and a representative is queued even without a scene change — so dense, slow-moving lectures still get coverage throughout. The scan uses a batched generateCGImagesAsynchronously call — all frame requests are issued at once and delivered in pipeline order, which is far faster than sequential random-access seeks.

Phase 2 — Capture

For each candidate timestamp, Harvestry first performs best-frame selection: it probes a short forward window of thumbnails, scores each for sharpness using the Variance of Laplacian, and chooses the fully-settled, sharpest one. This is why slides that fade, wipe, or build in are captured complete — never mid-entrance.

Up to 4 AVAssetImageGenerator instances then run concurrently to decode the chosen frames at full resolution (up to 2560×1440). A frame whose average brightness is below ~5% (a near-black fade transition) is discarded; the rest are saved as JPEG at 95% quality to the lecture's screenshots folder.

Choosing the Best Slides

Finding candidate frames is the easy part. The harder problem — the one that separates a useful set of screenshots from a dump of every frame — is deciding which candidates are genuinely new. A single similarity test can't do this, because slides and talking-head shots are "the same" along opposite axes: slides differ by their text but can look pixel-identical (two dark title cards), while talking-head shots differ in pixels (a gesture) but carry the same content. So Harvestry picks the right test per frame:

Slides — compared by their words. Harvestry runs accurate on-device OCR (VNRecognizeText, accurate level) on each text-bearing candidate and compares the actual words against previous captures using Jaccard set similarity. A new headline means a new slide and it is kept; substantially overlapping text means a repeat and it is skipped. This is the step a pixel comparison can't do — two dark slides that are 95% identical at the pixel level are told apart instantly by what they say.
Talking-head & b-roll — compared by a perceptual fingerprint. Frames with no meaningful text are hashed with a 64-bit difference hash (dHash), which captures a frame's overall composition and ignores fine detail. Shots that differ only by a gesture land within a small Hamming distance and collapse to a single representative; a cut to genuinely different footage does not. This is what keeps a 20-minute lecture from producing dozens of near-identical frames of the presenter.

ℹ️

All on-device. Every stage — shape detection, OCR, hashing — runs locally through Apple's Vision framework. No images, no recognized text, and no fingerprints ever leave your Mac.

Scene Change Detection

The low-level change metric is tuned for lecture content. Rather than mean pixel difference (which a slow camera drift or lighting shift can inflate), Harvestry measures coverage — the fraction of pixels that changed beyond a noise floor:

Pixel difference threshold: 20/255 — Small colour variations from video compression noise fall below this and are ignored.
Changed-pixel fraction — A presenter's webcam shifts only 5–10% of pixels when they move; a new slide rewrites 60–90%. Capture is only considered once a meaningful fraction changes and the frame then settles, so face movement never triggers a capture on its own.

Sharpness & Blur Detection

Sharpness is measured with the Variance of Laplacian (VoL), a standard image-focus metric, and is used in two places: best-frame selection during capture (picking the sharpest frame in the forward window) and the Seek to Clear player control.

For the stricter grid analysis, Harvestry evaluates a 3×2 grid of cells on a 320×180 thumbnail (each cell ≈ 107×90 px). Before a cell is scored, it passes two gates that exclude uninformative regions:

Darkness gate — Cells with average brightness below 15/255 are skipped, so letterbox bars and dark regions aren't scored as "blurry".
Featurelessness gate — Cells with very low texture variance (blank slide backgrounds, solid fills) are skipped. An empty white slide is featureless, not blurry.

Any cell that passes both gates and scores below the VoL threshold marks the frame as blurry. The grid approach prevents a sharp presenter webcam in one corner from rescuing a frame whose slide content is actually blurry — something whole-frame scoring cannot catch.

ℹ️

Strict vs. relaxed mode. The grid (strict) mode is used where a region-by-region judgement matters. The Seek to Clear control uses a relaxed, single-pass whole-frame VoL so it stops at the first frame that looks acceptably sharp to a human viewer, rather than enforcing the tighter grid requirement.

Max Interval Setting

The Max Interval setting in Settings → Screenshots controls the maximum time that can elapse between consecutive screenshots, regardless of scene activity.

Range: 15–120 seconds
Default: 30 seconds

At 30 seconds, a one-hour lecture will always have at least 120 screenshots, even if the slides change very infrequently. Reducing this below 30 seconds increases screenshot count; raising it above 30 seconds reduces it.

For very fast-paced content (code walkthroughs with frequent edits), reduce the interval. For slow slide decks with dense text, the default 30 seconds is usually appropriate.

Manual Frame Capture

After processing, you can add screenshots manually from the in-app player.

Navigate to the moment you want

Use the scrubber below the video player, or click on a transcript segment to jump to that timestamp.

Click "Add to Transcript"

The Add to Transcript button is in the scrubber bar below the player. Click it to capture the current frame at full resolution.

The screenshot is inserted in order

The new screenshot is saved to disk and inserted into the transcript panel in timestamp order. A "Modified, re-export needed" badge appears — click Reexport to regenerate the HTML with the new screenshot included.

Seek to Clear

The ◀ Seek to Clear and Seek to Clear ▶ buttons below the video player step through the video in 0.5-second increments — up to 20 seconds in either direction — and stop at the first frame that passes the blur check. This is useful when the video contains a brief sharp moment surrounded by motion blur (e.g., a camera panning to a new slide).

Seek to Clear uses the relaxed blur mode (single whole-frame VoL) rather than the strict grid mode. This means it may stop on frames that the automated pipeline would have rejected — but those frames are typically acceptable to human viewers.

While seeking, the button label shows a spinner. Seeking checks each frame live using AVAssetImageGenerator at 320×180 resolution for speed.

AV1 and VP9 Videos

If yt-dlp downloads a video encoded in VP9 or AV1 — which happens when a site has no H.264 stream available — Harvestry cannot use AVAssetImageGenerator for random-access seeks. Instead, it automatically transcodes the video to H.264 using h264_videotoolbox (hardware-accelerated) before running the screenshot pipeline.

You will see a "Converting video format…" spinner on the lecture detail view while this happens. The H.264 transcode replaces the original file to save disk space — the AV1/VP9 original is deleted after a successful transcode.

For the in-app player, the same transcode is reused. Harvestry caches the transcode path so it doesn't need to redo the conversion on subsequent opens.

💡

Prefer H.264 at download time. Harvestry's yt-dlp format string already prefers [vcodec^=avc] to request H.264 streams. Transcoding only happens when the platform has no H.264 option at all.

Previous Transcription

Next LLM Consolidation