Transcription

On-device speech recognition using WhisperKit and the Apple Neural Engine. No audio ever leaves your Mac.

How Transcription Works

Harvestry transcribes video audio using WhisperKit, an optimized implementation of OpenAI's Whisper automatic speech recognition (ASR) model built for Apple Silicon. The model runs directly on the Apple Neural Engine (ANE) — the dedicated machine-learning accelerator present in every M-series chip.

Key properties of the transcription engine:

Fully on-device. No audio or text is sent anywhere. The ANE processes everything locally.
Word-level timestamps. WhisperKit returns timestamps for every individual word, not just sentence or segment boundaries. These timestamps power the per-word highlight sync in the exported audio player.
Auto language detection. Whisper detects the spoken language automatically. It performs best on content where a single language dominates.
Background processing. Transcription runs asynchronously and does not block the UI. You can continue using Harvestry while transcription runs.

ℹ️

Apple Silicon required. The Apple Neural Engine is only available on M-series chips (M1 and later). Intel Macs are not supported.

Whisper Models

Five model sizes are available. Choose based on how you want to balance accuracy against speed and disk space.

Model	Disk Size	Speed	Accuracy	Best For
Tiny	~75 MB	Fastest	Basic	Testing; very short clips; clearly-spoken English where accuracy is less critical
Base	~150 MB	Very fast	Good	Good balance of size and quality for simple content; podcasts; meeting recordings
Small ★	~250 MB	Fast	Better	Recommended starting point; handles most lectures, accented speakers, and technical terms well
Medium	~770 MB	Moderate	High	Dense academic content; multiple speakers; stronger accents; non-English material
Large Turbo	~800 MB	Moderate	Best	Maximum accuracy; complex or highly technical lectures; situations where every word matters

All five models produce word-level timestamps. The accuracy difference is most noticeable with technical jargon, accented speech, and non-English languages. ★ Small is selected by default.

Downloading a Model

Models are not bundled with the app — they are downloaded on demand from Hugging Face and cached in ~/Library/Caches/. You must download at least one model before processing your first lecture.

Open Settings

Click the gear icon in the top-right toolbar, or use the menu bar: Harvestry → Settings.

Go to the Transcription tab

The first tab in Settings is Transcription. It shows the three available models and their download status.

Select a model and click Download

Click the model row to select it, then click Download. A progress bar shows download progress. The download requires roughly 75 MB–800 MB depending on the model.

Wait for completion

The model status changes to Ready with a green checkmark. You can close Settings and begin processing.

💡

Start with Small. At ~250 MB it downloads quickly, loads fast, and handles the vast majority of lectures well. If you need more accuracy — dense technical content, heavy accents, or non-English material — try Medium or Large Turbo and retranscribe. Tiny and Base are useful when you just need a quick result and do not require high accuracy.

Selecting a Model Per Lecture

Each lecture can be transcribed with any downloaded model, regardless of the global default set in Settings. On the lecture detail view, the Transcription step row shows the currently selected model in its subtitle. Click the model name to open a menu picker and select a different model before starting processing.

The model choice is stored per lecture and shown in the step row even after processing completes, so you can always see which model was used.

Model Mismatch Warning

If you change the global default model in Settings after a lecture has already been transcribed, the step row for that lecture will appear in orange. This indicates a mismatch between the model that was used and the current global default — it does not mean anything is wrong, just that the transcript was generated with a different model than your current preference.

To clear the mismatch warning, either:

Change the per-lecture model selection back to match the transcript, or
Retranscribe the lecture with the new model (see below).

Retranscribing

You can re-run transcription on a completed lecture at any time, for example to use a more accurate model or to pick up improvements after updating WhisperKit.

Select the lecture in the sidebar.
In the detail view, click the ⋯ overflow menu in the toolbar.
Choose Retranscribe.
A confirmation dialog warns that the existing transcript will be replaced. Confirm to proceed.

Retranscription runs only steps 1 and 4 of the pipeline — your existing screenshots are preserved. The HTML export is regenerated after the new transcript is ready.

⚠️

Annotations are replaced on retranscription. Highlights and margin notes anchored to specific transcript passages may not survive retranscription if the passage text changes substantially. Export your current HTML first as a backup before retranscribing important lectures.

Updating WhisperKit Models

Harvestry checks for model updates from Hugging Face every 24 hours. When an update is available, the model row in Settings shows an update badge. Click Update to download the newer weights. The update check compares local snapshot SHAs against the Hugging Face Hub HEAD.

Previous The Processing Pipeline

Next Screenshot Capture