Screenshot Capture

A two-pass pipeline that detects scene changes, captures full-resolution frames, and automatically discards blurry or dark images.

How It Works

Screenshot capture runs in two distinct phases using Apple's AVFoundation framework. Both phases use the GPU and video decoder, which runs in parallel with WhisperKit's transcription on the Neural Engine.

Phase 1 — Scan

The video is decoded at a small resolution (160×90) and a low frame rate (2 fps). Each frame is compared pixel-by-pixel with the previous frame. A scene change is registered when more than 20% of pixels differ from the previous frame by more than 20 out of 255 brightness levels.

If a slide remains static for longer than the configured max interval, the clock advances and a capture is queued even without a scene change — this ensures that dense, slow-moving lectures (such as a presenter who rarely changes slides) still get representative screenshots throughout.

Phase 2 — Capture

Up to 4 AVAssetImageGenerator instances run concurrently to extract full-resolution frames (up to 2560×1440) at each candidate timestamp. Before saving, each frame passes through two quality gates:

Darkness check — Frames where average pixel brightness is below 5% (nearly black) are discarded. These are typically fade-in/fade-out transitions.
Blur check — A grid-based Variance of Laplacian analysis determines whether the frame is in focus.

Frames that pass both gates are saved as JPEG at 95% quality to the lecture's screenshots folder.

Scene Change Detection

The threshold values are tuned for lecture content:

Pixel difference threshold: 20/255 — Small colour variations from video compression noise are below this threshold and are ignored.
Changed pixel fraction: 20% — A new slide appearing, a speaker moving substantially, or a camera cut will change at least 20% of pixels.

The scan uses a batched generateCGImagesAsynchronously call — all frame requests are issued at once and results are delivered in pipeline order, which is significantly faster than sequential random-access seeks.

Blur Filtering

Harvestry uses a 3×2 grid of cells evaluated on a 320×180 thumbnail. Each cell is approximately 107×90 pixels. The blur score is computed using the Variance of Laplacian (VoL) — a standard image sharpness metric.

Before the VoL is computed, each cell passes through two gates that exclude uninformative regions:

Darkness gate — Cells with average brightness below 15/255 are skipped. This prevents letterbox bars and dark regions from being scored as "blurry".
Featurelessness gate — Cells with very low texture variance (blank slide backgrounds, solid fills) are skipped. An empty white slide is not blurry, just featureless.

Any cell that passes both gates and has a VoL score below 250 causes the entire frame to be rejected as blurry. The grid approach prevents a sharp presenter webcam in the corner from rescuing a frame where the slide content is actually blurry — a problem that whole-frame VoL scoring cannot handle.

ℹ️

Strict vs. relaxed blur mode. Automated capture uses strict mode (all grid cells evaluated). The Seek to Clear player feature uses a relaxed mode (single-pass whole-frame VoL) — this stops at the first frame that looks acceptably sharp to a human viewer, rather than enforcing the tighter grid requirement used during automated batch capture.

Max Interval Setting

The Max Interval setting in Settings → Screenshots controls the maximum time that can elapse between consecutive screenshots, regardless of scene activity.

Range: 15–120 seconds
Default: 30 seconds

At 30 seconds, a one-hour lecture will always have at least 120 screenshots, even if the slides change very infrequently. Reducing this below 30 seconds increases screenshot count; raising it above 30 seconds reduces it.

For very fast-paced content (code walkthroughs with frequent edits), reduce the interval. For slow slide decks with dense text, the default 30 seconds is usually appropriate.

Manual Frame Capture

After processing, you can add screenshots manually from the in-app player.

Navigate to the moment you want

Use the scrubber below the video player, or click on a transcript segment to jump to that timestamp.

Click "Add to Transcript"

The Add to Transcript button is in the scrubber bar below the player. Click it to capture the current frame at full resolution.

The screenshot is inserted in order

The new screenshot is saved to disk and inserted into the transcript panel in timestamp order. A "Modified, re-export needed" badge appears — click Reexport to regenerate the HTML with the new screenshot included.

Seek to Clear

The ◀ Seek to Clear and Seek to Clear ▶ buttons below the video player step through the video in 0.5-second increments — up to 20 seconds in either direction — and stop at the first frame that passes the blur check. This is useful when the video contains a brief sharp moment surrounded by motion blur (e.g., a camera panning to a new slide).

Seek to Clear uses the relaxed blur mode (single whole-frame VoL) rather than the strict grid mode. This means it may stop on frames that the automated pipeline would have rejected — but those frames are typically acceptable to human viewers.

While seeking, the button label shows a spinner. Seeking checks each frame live using AVAssetImageGenerator at 320×180 resolution for speed.

AV1 and VP9 Videos

If yt-dlp downloads a video encoded in VP9 or AV1 — which happens when a site has no H.264 stream available — Harvestry cannot use AVAssetImageGenerator for random-access seeks. Instead, it automatically transcodes the video to H.264 using h264_videotoolbox (hardware-accelerated) before running the screenshot pipeline.

You will see a "Converting video format…" spinner on the lecture detail view while this happens. The H.264 transcode replaces the original file to save disk space — the AV1/VP9 original is deleted after a successful transcode.

For the in-app player, the same transcode is reused. Harvestry caches the transcode path so it doesn't need to redo the conversion on subsequent opens.

💡

Prefer H.264 at download time. Harvestry's yt-dlp format string already prefers [vcodec^=avc] to request H.264 streams. Transcoding only happens when the platform has no H.264 option at all.

Previous Transcription

Next LLM Consolidation